
Remarks 

Claims 1-16 are currently active. 

The Examiner has objected to Claims 2-7 and 9-15. 

The Examiner has rejected Claims 1, 8 and 16 as being unpatentable over 
Fatchi in view of Oren. Applicant respectfully traverses this rejection. 

Referring to Fatchi, there is disclosed a cross-connecting optical translator 
array. Fatchi teaches that optical cross-connects switches should be used with optical 
communications systems so as to maintain the speed and bandwidth advantages of using optical 
fiber. However, a disadvantage of using optical cross-connect switches, using current 
technology, is that they are not practical when the number of input and output ports is large. 
A further drawback of optical cross-connect switches arises when the optical transmission 
systems of the transmit and receive terminals and the transport medium is different wavelength 
standards to carry optical signals. See column 1, lines 27-56. 

Fatchi teaches an optical communication system 100 having an input optical 
communications system 1 10 for transmitting optical signals through an optical cross-connect 
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translator 150, that includes a cross-connect optical translator array 155. Optical cross- 
connect translator 150 directs the optical signals to optical communications systems 160 and 
165. The operating wavelength optical communication systems 110, 160 and 165 can be the 
same or different. 

Operationally, optical signals 171-175 are multiplexed by optical multiplexer 
120 using wavelength division multiplexing techniques. The resulting multiplex optical signal 
is transported to multiple optical amplifiers 130 until reaching optical cross-connect translator 
150. Prior to entering optical cross-connect translator 150, the multiplexed optical signals are 
demultiplexed by optical demultiplexer 140 and transmitted to corresponding input ports of 
optical cross-connect transmitter 150. The optical signals are switched and translated as 
required to the optical transmission systems 161 and 165 and receive and transport the output 
signals. 

Cross-connecting optical translate array 155 includes N broad-band optical-to- 
electrical converters 201-210 that convert N optical signals to N electrical signals. Broadband 
devices are used so that any wavelength optical signal can be received and processed by optical 
transmitter array 155. Each of the N converters 201-210 is coupled to an input port of an N X 
N non-blocking, fully connected electronic space switch fabric 220. A cross-connect 
controller 250 is used to control the mapping of the connections between the input and output 
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ports of electronic space switch fabric 220. The output ports of electronics based switch fabric 
220 are coupled to N optical transmitters 231-240. N optical transmitters are selected to 
generate any of the standard wavelengths. Converters 201-210 and transmitters 231-240 
operate as an optical transmitter that receives optical signals of a first wave length and 
transmits optical signals at a second wavelength, if required. As such, the present invention is 
particularly well-suited for multiple vendor environments, where each vendor may require 
transmission at different wavelengths. See column 3, lines 1-56. 

It should be noted that nowhere in the teachings of Fatchi is there any mention 
or hint of the limitation "the port card having a mechanism for tolerating whether any one of 
the plurality of fabrics has a failure and still sending correct packets to the network", as found 
in Claim 1 . 

Referring to Oren, there is disclosed an ATM switching fabric. Oren teaches 
that an object of the present invention is to provide an ATM switching fabric incorporating 
redundancy in its internal architecture. Oren also teaches another object of the present 
invention is to provide control protocols for communications between modules internal to the 
ATM switching fabric. See column 3, lines 5-11. 
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Oren teaches a switch 10 has input port interface modules 50 which feed to 
input buses, top bus 90 and bottom bus 94. An output port interface module 51 is connected 
to 2 output buses, left bus 92 and right bus 96. The junction modules 52 are attached to both 
the input and both output buses. Output module 54 has dedicated control signal lines 108 
connected to each junction module 52. Each input port interface module 50 and output port 
interface module 51 are implemented as a single component. See column 7, lines 45-57. 

Output bus contention, inherent in the matrix architecture, is resolved by cell 
buffering at each intersection of input and output buses and by an arbitration mechanism which 
schedules the servicing of the various cell buffer pools located at each intersection and attached 
to each output bus. Each set of four adjacent intersections is managed by a single junction 
module 52. Each of the four intersections may be managed by a separate function module. 
All junction modules attached to the same object output bus report to and are scheduled for 
service by the same output module 54. Thus, each fabric column 61 in the switch fabric 150 
includes one output module 54 and 4 junction modules 52. See column 8, lines 7-20. 

Oren teaches four fabric columns 61 makeup switching fabric 150, each fabric 
column is constructed as a separate printed circuit board called fabric cards. A switching 
fabric is modular because any number of fabric cards can be inserted to construct a switch of 
any desired size. The four fabric cards, one for each fabric column, are connected together by 
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a back plane. The modularity of switching fabric 150 allows for the addition of a redundant 
fifth fabric card. While all fabric cars are operating normally, the redundant fabric card 
remains inactive and does not participate in the traffic flow. However, when a management 
system within switch 10 detects a failure on one of fabric parts, it configures the redundant 
fabric card the same as the malfunctioning fabric card, takes a malfunctioning fabric card out 
of service and activates a redundant fabric card. See column 19, lines 35-53. 

It is black letter patent law that for the teachings of references to be combined, 
there must be teachings in the references themselves to combine the references the Examiner 
relies upon to find here that Claim 1 is obvious in view of Fatchi and Oren. In regard to 
Fatchi and Oren, there is no teaching or suggestion in the references themselves to combine 
their respective teachings. In fact, this follows, because with respect to Fatchi, it is totally 
silent about any type of redundant fabric because its object is to provide for a cross-connecting 
optical translator array that can switch multiple vendor environments for each vendor may 
require transmission at different wavelengths. See column 3, line 55 and column 1, lines 47- 
55. In contrast, it is the object of Oren to provide an ATM switching fabric incorporating 
redundancy in its internal architecture, and to provide control protocols for communications 
between modules internal to the ATM switching fabric. See column 3, lines 5-10. 
Accordingly, for this reason alone that there is no teaching or suggestion in the references 
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themselves to combine their respective references the Examiner relies upon to find obviousness 
with regard to Claim 1, Claim 1 is patentable over the combination of Fatchi and Oren. 

Furthermore, it is black letter law, that teachings cannot be taken out the 
context in which they are found. With respect to Fatchi, there is taught an optical switch, that 
simply speaks to signals, that requires a cross-connecting optical translate array 155. In 
contrast, Oren teaches an ATM switching fabric which transfers packets and requires input 
port interface modules, output port interface modules and junction modules. There is no hint 
or suggestion of how to combine these two different contexts of technology so that the optical 
switch taught by Fatchi can somehow or other be modified to include a redundant fabric so 
that somehow or other, without undue experimentation or guesswork, the entire architecture of 
the switch taught by Fatchi can be modified so that the optical signals can be diverted when 
one of its fabrics would fail. In fact, it is respectfully submitted that one skilled in the art 
would require significant development and research to be able to somehow or other modify the 
teachings of the optical switch taught by Fatchi to include a redundant fabric that can utilized 
when one of the other fabrics fails. 

Consistent with the law that teachings cannot be taken out of the context in 
which they are found, is the requirement that the Examiner cannot use hindsight to arrive at 
applicant's claimed invention. Here, as already explained above, there is no teaching or 
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suggestion in the references themselves to combine their respective teachings. The only 
reason to combine these references is from the hindsight of using applicant's claim as a road 
map to find the different elements of the claim in different prior art references, and having 
found the elements themselves in the different references, concluding Claim 1 is arrived at. 
However, again, it is respectfully submitted that this is not patent law. For this reason too, 
that the reason to combine these references is only from hindsight, Claim 1 is the patentable 
over the prior art record. 

Claims 8 and 16 are patentable for the same reasons Claim 1 is patentable. 
Furthermore, Claim 8 has the limitations of "sending to fabrics of the switch portions of the 
packets as stripes from the port card" or "sending back to the port card the portions of the 
packets as stripes from the fabrics" . Nowhere in the teachings of Fatchi or Oren is there any 
reference to sending portions of the packets as stripes anywhere. For this reason, Claims 8 
and 16 are patentable over Fatchi in Oren. 

The Examiner's attention is brought to somewhat related application serial 
number 09/333,450. 

A substitute clean specification and marked up original specification are 
enclosed. The marked original specification has deletions bracketed and additions underlined. 
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No new matter has been added. The information deleted is unnecessary for enablement and is 
considered superfluous information that applicant desires not to have published. 

In view of the foregoing amendments and remarks, it is respectfully requested 
that the outstanding rejections and objections to this application be reconsidered and 
withdrawn, and Claims 1-16, now in this application be allowed. 



Respectfully submitted, 





JEFF SCHULZ 



Attorney for Applicant 
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RECEIVER DECODING ALGORITHM TO ALLOW HITLESS 
N+l REDUNDANCY IN A SWITCH 



FIELD OF THE INVENTION 



The present invention is related to a switch having 
fabrics that can recover from the failure of a single fabric. More 
specifically, the present invention is related to a switch having 
fabrics that can recover from the failure of a single fabric with 
the aid of a check sum that is added to parity dat; 
striped onto the fabrics of the switch. 

NOV 2 8 2003 

BACKGROUND OF THE INVENTION 
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A switch which stripes data onto multiple fabrics and 
sends parity data to another fabric has been described in U.S. 
patent application serial number 09/333,450, incorporated by 
reference herein. See also U.S. patent application serial number 
09/293,563 which describes a wide memory TDM switching system, 
incorporated by reference herein. The present invention describes 
the receiver algorithm used in a data communications system which 
allows for detection/recovery of data and close to hitless recovery 
of a single element in a switch. 

The present invention allows a switch that utilizes 
striping to tolerate a single hardware failure without requiring a 
change-over time. Conventional redundancy systems require 
detection of bad data and reconfiguration of the switch to an 
alternate source of data. The time between the failure to the 
successful reconfiguration of the system results in lost data which 
impacts traffic going through the switch. The technique of the 
present invention will- have only a small portion of the traffic 
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affected. A switch which stripes data onto multiple fabrics and 
sends parity data to another fabric has been described in U.S. 
patent application serial number 09/333,450, incorporated by 
reference herein. See also U.S. patent application serial number 
5 09/293,563 which describes a wide memory TDM switching system, 
incorporated by reference herein. 

SUMMARY OF THE INVENTION 

The present invention pertains to a switch of a network 
for switching packets. The switch comprises a plurality of fabrics 
10 which switch portions of packets. The switch comprises a port card 
connected to the fabrics and the network for receiving packets from 
and sending packets to the network. The port card has a mechanism 
for tolerating whether any one of the plurality of fabrics has a 
failure and still sending correct packets to the network. 

15 The present invention pertains to a method for switching 

packets. The method comprises the steps of receiving packets at a 
port card from a network of a switch. Then there is the step of 
sending to fabrics of the switch portions of the packets as stripes 
from the port card. Next there is the step of switching the 

20 portions of the packets with the fabrics. Then there is the step 
of sending back to the port card the portions of the packets as 
stripes from the fabrics. Next there is the step of sending 
correct packets with the port card to the network even though one 
of the fabrics has a failure. 
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BRIEF DESCRIPTION OF THE DRAWINGS 

In the accompanying drawings, the preferred embodiment of 
the invention and preferred methods of practicing the invention are 
illustrated in which: 

5 Figure 1 is a schematic representation of packet striping 

in the switch of the present invention. 

Figure 2 is a schematic representation of an OC 48 port 

card. 

Figure 3 is a schematic representation of a concatenated 
10 network blade. 

Figure 4 is a schematic representation regarding the 
connectivity of the fabric ASICs. 

Figure 5 is a schematic representation of a 32 bit — cell 

transfer . 

15 Figure 6 irs a schematic representation regarding 

back pressure. 

Figure 7 is a schematic repres e ntation of a 32 bit packet 

transferred using external connection number bus. 

Figure 0 is a schematic representation of a 04 bit cell 

2 0 transferred. 
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Figure 9 is a schematic representation of a G4 bit packet 

transfer . 

Figure 10 is a schematic representation of ATM cell flow 

in the switch. 

5 Figure [[11]] 5 is a schematic representation of sync 

pulse distribution. 

Figure — K! — is — a — schematic — representation — regarding — the 
write cycle, 

Figure — 1-3 — rs — a — schematic — representation — of — t-he — read 

10 cycle . 

Figure — W — irs — a — schematic — representation — of — the — striper 

AGIC architecture. 

Figure 15 is a schematic presentation of the aggregator 

AGIC architecture . 

15 Figure — 3r6 — w — a — schematic — representation — of — a — memory 

controller AGIC architecture. 

Figure 17 is a schematic representation of the wide cache 

line shared memory architecture. 



Figure 10 — is a schematic representation of a separator 

20 AGIC architecture. 
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Figure 19 is a sch e matic representation of an unstriper 

ASIC architecture. 

Figure [[20]] 6 is a schematic representation regarding 
the relationship between transmit and receive sequence counters for 
5 the separator and unstriper, respectively. 

Figure — zHt — i-s — a — schematic — representation — erf — a — receive 
synchroni zer . 

Figure [[22]] 7 is a schematic representation of a switch 
of the present invention. 

10 DETAILED DESCRIPTION 

Referring now to the drawings wherein like reference 
numerals refer to similar or identical parts throughout the several 
views, and more specifically to figure [[22]] 2 thereof, there is 
shown a switch 10 for switching packets. The switch 10 comprises 

15 a plurality of fabrics 14 which switch 10 portions of packets. The 
switch 10 comprises a port card 12 connected to the fabrics 14 and 
the network 11 for receiving packets from and sending packets to 
the network 11. The port card 12 has a mechanism 16 for tolerating 
whether any one of the plurality of fabrics 14 has a failure and 

20 still sending correct packets to the network 11. 

Preferably, the plurality of fabrics 14 includes n 
fabrics 14 which receive from and send to the port card 12 portions 
of packets, where n is greater than or equal to 2 and is an 
integer, where one of the fabrics is a parity fabric 18 which sends 
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to and receives from the port card 12 parity data regarding the 
packets- The port card 12 preferably has a striper 20 which sends 
portions of packets as stripes to the n fabrics 14 to which they 
correspond, and which calculates a checksum of the packet and adds 
5 it to the packet before it is striped. 

Preferably, the port card 12 has an unstriper 22 which 
receives the stripes and parity data from the fabrics 14, 
calculates the parity data from the stripes received, and compares 
the parity data received with the parity data calculated to 
10 determine if one of the fabrics 14 has failed. 

The unstriper 22 preferably calculates the checksum for 
each fabric 14, replaces the data from each fabric in turn, and 
compares the calculated checksum for each fabric to the checksum 
calculated for each fabric received with the packet calculated 
before the packet is striped, if the unstriper 22 has determined 
one of the n fabrics 14 has failed, and recovers the stripe from 
the fabric that has failed from the other stripes. Preferably, the 
checksum is 16 bits. Each fabric preferably has an aggregator 24 
which receives the stripes from the port card 12, a memory 
controller 26 in which the stripes are stored and a separator 28 
which sends the stripes back to the port card 12. 

The present invention pertains to a method for switching 
packets. The method comprises the steps of receiving packets at a 
port card 12 from a network 11 of a switch 10. Then there is the 
25 step of sending to fabrics 14 of the switch 10 portions of the 
packets as stripes from the port card 12. Next there is the step 
of switching the portions of the packets with the fabrics 14. Then 
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there is the step of sending back to the port card 12 the portions 
of the packets as stripes from the fabrics 14* Next there is the 
step of sending correct packets with the port card 12 to the 
network 11 even though one of the fabrics 14 has a failure. 

5 Preferably, the sending to fabrics 14 of the switch step 

includes the step of sending to n respective fabrics 14 n stripes 
of portions of the packets, where n is greater than or equal to 2 
and is an integer and where one of the fabrics is a parity stripe 
having parity data concerning the packet to a parity fabric 18. 
10 Before the sending the n stripes step there is preferably the step 
of calculating a check sum of the packet with a striper 20 and 
adding it to the packet before it is striped. 

Preferably, the sending back to the port card 12 step 
includes the step of receiving at an unstriper 22 of the port card 

15 12 the stripes and parity stripe from the fabrics 14, calculating 
with the unstriper 22 the parity data from the stripes received, 
and comparing the parity data received from the parity stripe with 
the parity data calculated by the unstriper 22 to determine if one 
of the fabrics 14 has failed. After the comparing step, there is 

20 preferably the step of calculating with the unstriper 22 the check 
sum, replacing the data from each fabric in turn, comparing the 
calculated check sum for each fabric to the check sum received with 
the packet calculated before the packet is striped, identifying 
which fabric has failed, and recovering the stripe from the fabric 

25 that has failed from the other stripes. 

The switching step preferably includes the step of 
receiving the portions of the packets as stripes at an aggregator 
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24 of the fabric, and storing the portions of the packets in a 
memory controller 26 of the fabric. Preferably, the sending back 
to the port card 12 step includes the step of sending with a 
separator 28 of the fabric the portions of packets in the memory 
5 controller 26 as stripes back to the unstriper 22 of the port card 
12. 



In the operation of the invention, the receive algorithm 
is designed to tolerate one hardware failure and recover from it. 
Before data is striped, a 16-bit checksum is added to the packet, 
10 and is part of the striped data. The resulting data then has 
parity calculated similar to a RAID 5 disk array. 

From a receiver perspective, the data comes in from N 
fabrics 14 (which N ranges from 2-13) . The data is parity checked 
to see if any fabric is bad. Note that the parity check does not 
15 identify the source of the bad data, only that some data is not 
correct . 



If some data is not correct, the receiver then builds a 
candidate data stream assuming that each fabric's input stream 
could be bad. Each of these candidate reconstruction streams is 
20 then run through the checksum algorithm. There are three different 
results from the N checksum checks. 



a. No checksum may pass. Note that this cannot happen 
in a single fabric failure scenario. If a single 
fabric fails, then one of the N checksums removes 
25 the failing fabric, so that checksum will pass. 

The data is dropped, and can be logged as a 
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multiple failure, although the failure locations 
cannot be determined. 

b. A single checksum may pass. A single failure 
occurred and can be identified as the fabric which 
5 was removed from the successful calculation. All 

other checksum calculations which use the corrupted 
data failed. This data is recovered. 

More than one checksum may pass. If more than one 
checksum passes, both the stream reconstructed 
without the error and at least one of the 
reconstructed streams with errors passed. In this 
case, the traffic must be dropped. 

If a reasonably strong checksum is used, then the 
probability of random error data passing is bounded by l/2 n , where 
15 N is the number of bits in the checksum. For BFS, N=16, which 
gives a best case bound of a single packet having a probability of 
(1/65536) * number of fabrics of being recovered. 

The identification of bad fabrics cannot be concluded 
from either a or c in a single packet. If statistics are collected 

20 over multiple packets, the errors will cluster around all the 
calculations which involve data from the failing fabric. Across 
multiple packets, one lane should consistently have no errors, 
allowing the error free lane to be identified in a reasonably small 
number of packets. The probability of the failing fabric remaining 

25 unidentified is ((1/65536) A number of packets) * number of 
fabrics . 



c . 

10 
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If the number of packets is set to 4 or more, the 
resulting probability of not identifying an error is less than 5 * 
10 A -20. 

The switch uses RAID techniques to increase overall 
5 switch bandwidth while minimizing individual fabric bandwidth. In 
the switch architecture, all data is distributed evenly across all 
fabrics so the switch adds bandwidth by adding fabrics and the 
fabric need not increase its bandwidth capacity as the switch 
increases bandwidth capacity. 

10 Each fabric provides 40G of switching bandwidth and the 

system supports 1, 2, 3, 4, 6, or 12 fabrics, exclusive of the 
redundant /spare fabric. In other words, the switch can be a 40G, 
80G, 120G, 160G, 240G, or 480G switch depending on how many fabrics 
are installed. 

15 A portcard provides 10G of port bandwidth. For every 4 

portcards, there needs to be 1 fabric. The switch architecture 
does not support arbitrary installations of portcards and fabrics. 

The fabric ASICs support both cells and packets. As a 
whole, the switch takes a "receiver make right" approach where the 
20 egress path on ATM blades must segment frames to cells and the 
egress path on frame blades must perform reassembly of cells into 
packets . 

There are currently eight switch ASICs that are used in 
the switch: 



Striper - The Striper resides on the portcard and 
SCP-IM. It formats the data into a 12 bit data 
stream, appends a checkword, splits the data stream 
across the N, non-spare fabrics in the system, 
generates a parity stripe of width equal to the 
stripes going to the other fabric, and sends the 
N+l data streams out to the backplane. 

Unstriper - The Unstriper is the other portcard 
ASIC in the the switch architecture. It receives 
data stripes from all the fabrics in the system. It 
then reconstructs the original data stream using 
the checkword and parity stripe to perform error 
detection and correction. 

Aggregator - The Aggregator takes the data streams 
and routewords from the Stripers and multiplexes 
them into a single input stream to the Memory 
Controller. 

Memory Controller - The Memory controller 
implements the queueing and dequeueing mechanisms 
of the switch. This includes the proprietary wide 
memory interface to achieve the simultaneous en- 
/de-queueing of multiple cells of data per clock 
cycle. The dequeueing side of the Memory Controller 
runs at 80Gbps compared to 4 0Gbps in order to make 
the bulk of the queueing and shaping of connections 
occur on the portcards. 
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5. Separator - The Separator implements the inverse 
operation of the Aggregator. The data stream from 
the Memory Controller is demultiplexed into 
multiple streams of data and forwarded to the 
5 appropriate Unstriper ASIC. Included in the 

interface to the Unstriper is a queue and flow 
control handshaking . 

6i Trident Trident is, strictly speaking, — not one of 

the ASICs. — It is actually one-half of the Poseidon 
10 chipset . — Trident will be used to implement the ATM 

portcards within the switch. 

Vortex Vortex — ±-s — the — partner — t-o — Trident — irt — the 

Poseidon — chipset . — Vortex — i-s — the — ingress — ASIC — and 
Trident the egress device. — Together, — the two chips 
15 implement — a — 2 . 5Gbps — ingress, — 5Gbps — egress — system 

capable of supporting up to OC 40c ports. 

8-: Reassembler Wte — Reassembler — ASIC — rs — t+re — frame 

blade equivalent to Trident. — It will be capable of 

taking cell streams from the Unstriper smd 

2 0 converting them into frames. 

There are 3 different views one can take of the 
connections between the fabric: physical, logical, and "active." 
Physically, the connections between the portcards and the fabrics 
are all gigabit speed differential pair serial links. This is 
25 strictly an implementation issue to reduce the number of signals 
going over the backplane. The "active" perspective looks at a 
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single switch configuration, or it may be thought of as a snapshot 
of how data is being processed at a given moment. The interface 
between the fabric ASIC on the portcards and the fabrics is 
effectively 12 bits wide. Those 12 bits are evenly distributed 
5 ("striped") across 1, 2, 3, 4, 6, or 12 fabrics based on how the 
fabric ASICs are configured. The "active" perspective refers to the 
number of bits being processed by each fabric in the current 
configuration which is exactly 12 divided by the number of fabrics. 

The logical perspective can be viewed as the union or max 
10 function of all the possible active configurations. Fabric slot #1 
can, depending on configuration, be processing 12, 6, 4, 3, 2, or 
1 bits of the data from a single Striper and is therefore drawn 
with a 12 bit bus. In contrast, fabric slot #3 can only be used to 
process 4, 3, 2, or 1 bits from a single Striper and is therefore 
15 drawn with a 4 bit bus. 

Unlike previous switches, the switch really doesn't have 
a concept of a software controllable fabric redundancy mode. The 
fabric ASICs implement N+l redundancy without any intervention as 
long as the spare fabric is installed. 

20 As far as what does it provide; N+l redundancy means that 

the hardware will automatically detect and correct a single failure 
without the loss of any data. 

The way the redundancy works is fairly simple, but to 
make it even simpler to understand a specific case of a 120G switch 
25 is used which has 3 fabrics (A, B, and C) plus a spare (S) . The 
Striper takes the 12 bit bus and first generates a checkword which 
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gets appended to the data unit (cell or frame) . The data unit and 
checkword are then split into a 4-bit-per-clock-cycle data stripe 
for each of the A, B, and C fabrics (A 3 A 2 A 1 A 0 , B 3 B 2 B 1 B 0 , and C 3 C 2 C 1 C 0 ) . 
These stripes are then used to produce the stripe for the spare 
5 fabric S 3 S 2 S 1 S 0 where S n = A n XOR B n XOR C n and these 4 stripes are 
sent to their corresponding fabrics. On the other side of the 
fabrics, the Unstriper receives 4 4-bit stripes from A, B, C, and 
S. All possible combinations of 3 fabrics (ABC, ABS, ASC, and SBC) 
are then used to reconstruct a "tentative" 12-bit data stream. A 

10 checkword is then calculated for each of the 4 tentative streams 
and the calculated checkword compared to the checkword at the end 
of the data unit. If no error occurred in transit, then all 4 
streams will have checkword matches and the ABC stream will be 
forwarded to the Unstriper output. If a (single) error occurred, 

15 only one checkword match will exist and the stream with the match 
will be forwarded off chip and the Unstriper will identify the 
faulty fabric stripe. 

For different switch configurations, i.e. 1, 2, 4, 6, or 
12 fabrics, the algorithm is the same but the stripe width changes. 

20 If 2 fabrics fail, all data running through the switch 

will almost certainly be corrupted. 

There are basically two options, — both requiring that the 

defective fabrics be known through some means. Unfortunately, — rrr 

a double failure system, — the hardware that detects and identifies 
25 a — failed — fabric — will — only — be — able — to — identify — the — fabric — that 
failed — first — f-rf — there — wers — one) . — Identifying — both — the — failed 
fabrics — may — only be — possible — through — a — trial - and ' error — approach 
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unless the switch software and/or switch diagnostics can develop 
tests to identify the second failure. — 

The recommended approach would be to shut down the switch 

and install as many good fabrics as possible beginning with slot 1. 
5 This allows the maximum bandwidth and redundancy be available given 
the functional hardware available. 

Wre — other — option — ts — to — have — fc-he — switch — software 

reconfigure the switch to use fewer — fabrics . — This is an inferior 
solution for two reasons : 

10 i-. ft — earr — never — provide — more — bandwidth — than — t+re 

recommended approach . 

zh ft — requires — substantial — thought — and — understanding 

of — the — switch — by — ttre — us e r — irr — order — bo — determine 
what is the maximum operational configuration. 

15 Dasically, the user must start at fabric slot 1 and count 

-the — number — of — operational — fabrics . ff — t+re — spare — fabric — i-s- 

operational, — then — it may — be — used — to — "cover" — for — the — first — non 
operational fabrics . 

Exam p l e — — ft — r e dundant — 240G — switch — (G+l — fabrics -) — h« — suff e r e d 
20 fab r i c failu re s in sl o ts 3 and 4. Starting with slot 1 ther e ar e 2 
operational fabrics and the spare is available to cover for slot 3. 
This switch can be reconfigured to a 120G non - redundant switch or 
an 00G redundant switch. Note than by swapping fabric 5 and G into 
slots 3 and 4, — this switch could be a 1G0G redundant switch. 
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Exampl e — — ft — r e dundant — 400G — s w it c h — suff e rs — fabric — failur e s — in 

sl o ts 1 and th e s p ar e . Start swapping fabrics. — Blot 1 is d e ad and 

the spare is not available to cover for it. This is the worst case 
scenario . 

5 Exam p l e — #9-: — ft — r e dundant — 400G — switch — suff e rs — fabric — failur e s — in 
sl o ts 2 and 10 . — Th e re is on e functional fabric counting from slot 
1 or D if the spare is used to cover for slot 2. — This switch can be 
configured either as 40G redundant or 240G non - redundant . Note that 
fabrics 7,0, — and D do not help since the only legal configuration 
10 after G fabrics is all 12. 

The fabric slots are numbered and must be populated in 
ascending order. Also, the spare fabric is a specific slot so 
populating fabric slots 1, 2, 3, and 4 is different than populating 
fabric slots 1, 2, 3, and the spare. The former is a 160G switch 
15 without redundancy and the latter is 120G with redundancy. 

Firstly, the ASICs are constructed and the backplane 
connected such that the use of a certain portcard slots requires 
there to be at least a certain minimum number of fabrics installed, 
not including the spare. This relationship is shown in Table 0. 

20 In addition, the APS redundancy within the switch is 

limited to specifically paired portcards. Portcards 1 and 2 are 
paired, 3 and 4 are paired, and so on through portcards 47 and 48. 
This means that if APS redundancy is required, the paired slots 
must be populated together. 
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To give a simple example, take a configuration with 2 
portcards and only 1 fabric. If the user does not want to use APS 
redundancy, then the 2 portcards can be installed in any two of 
portcard slots 1 through 4. If APS redundancy is desired, then the 
5 two portcards must be installed either in slots 1 and 2 or slots 3 
and 4 . 



Portcard 


Minimum 


Slot 


# of 




Fabrics 


1-4 


1 


5-8 


2 


9-12 


3 


13-16 


4 


17-24 


6 


25-48 


12 



15 Table 0: Fabric Requirements for Portcard Slot Usage 

To add capacity, add the new fabric (s), wait for the 
switch to recognize the change and reconfigure the system to stripe 
across the new number of fabrics. Install the new portcards. 

Note that it is not technically necessary to have the 
20 full 4 portcards per fabric. The switch will work properly with 3 
fabrics installed and a single portcard in slot 12. This isn't cost 
efficient but it will work. 

To remove capacity, reverse the adding capacity 

procedure . 

25 If the switch is oversubscribed, i.e. install 8 portcards 

and only one fabric. 
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It should only come about as the result of improperly 
upgrading the switch or a system failure of some sort. The reality 
is that one of two things will occur, depending on how this 
situation arises. If the switch is configured as a 40G switch and 
5 the portcards are added before the fabric, then the 5 th through 8 th 
portcards will be dead. If the switch is configured as 80G non- 
redundant switch and the second fabric fails or is removed then all 
data through the switch will be corrupted (assuming the spare 
fabric is not installed) . And just to be complete, if 8 portcards 
10 were installed in an 80G redundant switch and the second fabric 
failed or was removed, then the switch would continue to operate 
normally with the spare covering for the failed/removed fabric. 

The switch includes the following features: 

Gcales from 40Gbps to 400Gbps (40, 00, 120, 1G0, 240, 400 

15 GD/sec are the supported configurations) . 

Owitches ATM cells and variable - length packets 

• Nil — fabric redundancy with error detection and recovery 
supported in the AGIC chipset. — 

Native APG support 

20 Support up to 19GK cell shared memory, — 921 GK unicast and 

G4K multicast connections. 

Gupport 2x port spe e d for fabric dequeueing — (2.D GD/sec 
±rr f — 5 GD/sec out for each OC40 port) . 
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Supports both OC40c ports and OC192c ports. 

Provides — port/priority — queuing — similar — t-o — past — switch 
fabrics . — Four priorities are provided for 40 120 GD/sec 
switches, — 2 priorities/port for 240 GD/sec switches and 
5 1 priority for 400 GD/sec switches. 

- ASICs utilize 250 MHz HGTL point to point busses between 

fabric ASICs and interface with the backplane using stan ■ 
dard GDit transceivers. 

- Interface — to — port — cards — chips — tree — 00-125 — MH-z — LVTTL 

10 signals . 

Gupport output port supplied back 1 pressure ■ 

¥+re — significant — architectural — difference — between — the 

switch — arrd — past — switches — irs — that — incoming — traffic — i-s — routed — fee 
multiple — switch — fabrics . — Each — fabric — rs — designed — t-o — enqueue — 4-6- 

15 GD/sec of data and dequeue 00 GD/sec of data. — As data comes — into 
the switch, — it is broken up on a bit by bit basis and part of each 
packet is sent to each fabric in the box. The fabrics will all make 
the same enqueuing and drop decisions, — and all schedule fragments 
of a packet/cell at the same time. Each fabric sends its portion of 

20 the packet or cell to the output port card which reassembles the 
fragment into the complete cell/packet which is then passed to a 
shared memory ASIC for per port storage and scheduling. The XOR of 
the data — sent — fee — each — fabric is — sent — to a — spare — fabric. — fn — the 
event — of a — fabric failure, — that — fabrics data — can be recovered by 
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utilizing — tfr-e — good — data — bits — and — the — parity — fabric — bits — to 

recalculate — airy — fabrics — data . The — striping — erf — data — to — fabrics 

happens on the basis of 40 bit chunks. — This allows the switch to 
support 1,2,3,4,0 and 12 fabrics. 

5 Five — AOlCs — build — the — switching — functionality — for — t+re 

switch . — These ASICs are described briefly below. 



TABLE 1 : Th e switch AGIC3 





Function 


Striper 


Takes incoming eell from Vortex (or OC192c equivalent) or from POS input stage and breaks the data up 


into the appropriate ehunks to go to caeh fabric, calculates the parity for the spare fabric, concatenates a 




Aggregator 


Receives separate data and route word busses from multiple stripers. Converts from the reasonably si in 1 


dedicated striper-> Aggregator busses to a wide shared bus to the memory controllers. 


Memory 


Actually perform the queueing of data for the fabries. Queues the cell into one of 200 queues (1 92 UC queues 


Control 1 era 


4 MC queues and 4 control port queues). — All drops which occur in the chipset occur here. 


Separator 


Combines traffic from multiple memory controllers to one fabric output. Provides rate control of the stream 


of data leaving the fabric for each OC48 or OC192c port. 






any fabric and attempts to reconstruct the good data. Passes the data to the output memory controller. If the 
striper is on an ATM blade and the data is a packet, it is segmented before passing onto the ATM controller. 



15 Figure 1 shows packet striping in the switch. * 

The chipset supports ATM and POS port cards in both OC48 
and OC192c configurations. OC48 port cards interface to the 
switching fabrics with four separate OC48 flows. OC192 port cards 
logically combine the 4 channels into a 10G stream. The ingress 
20 side of a port card does not perform traffic conversions for 
traffic changing between ATM cells and packets. Whichever form of 
traffic is received is sent to the switch fabrics. The switch 
fabrics will mix packets and cells and then dequeue a mix of 
packets and cells to the egress side of a port card. 
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The egress side of the port is responsible for converting 
the traffic to the appropriate format for the output port. This 
convention is referred to in the context of the switch as "receiver 
makes right". A cell blade is responsible for segmentation of 
5 packets and a cell blade is responsible for reassembly of cells 
into packets. To support fabric speed-up, the egress side of the 
port card supports a link bandwidth equal to twice the inbound side 
of the port card. For each OC40 interface, — the unstriper supports 
a bandwidth of GGD/sec and for each 0C1Q2 interface, a bandwidth of 
10 24 GD/sec — ( combined routeword — I — data) . 

The block diagram for a Poseidon-based ATM port card is 
shown as in Figure 2. Each 2 . 5G channel consists of 4 ASICs: Vortex 
Inbound TM and striper ASIC at the inbound side and unstriper ASIC 
and Trident outbound TM ASIC at the outbound side. 

15 At the inbound side, the Vortex ASIC aggregates 1 OC-48c 

or 4 0C-12c interfaces are aggregated . Each vortex sends a 2 . 5G 
cell stream into a dedicated striper ASIC (using the BIB bus, as 
described below) . The striper converts the vortex supplied 
routeword into two pieces. A portion of the routeword is passed to 

20 the fabric to determine the output port(s) for the cell. The 
entire routeword is also passed on the data portion of the bus as 
a routeword for use by the outbound memory controller. The first 
routeword is termed the "fabric routeword". The routeword for the 
outbound memory controller is the "egress routeword". 

25 At the outbound side, the unstriper ASIC in each channel 

takes traffic from each of the port cards, error checks and correct 
the data and then sends correct packets out on its output bus. The 
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unstriper uses the data from the spare fabric and the checksum 
inserted by the striper to detect and correct data corruption. 
SGbps — traffic — « — then — sent — to the — Trident AGIC — erf — the — Poseidon 
chipset. The Trident ASIC stores the incoming cells based on per - VC 

5 queues — arrd — sends — them — otrfe — to — ©€ 12c/0C 40c — interfaces — srb 

aggregated speed of 2 .SGbps. 

For the POO interfaces, the striper AGIC input bus speeds 

trp — to — 3 . 2Gbps — to — handle — POS — overhead. — Fhe — outbound — side, — the 
unstriper — talks — t-o — a — reassembly — stage — which — irs — currently — being 
10 defined . 

Figure 2 shows an OC4 8 Port Card. 

The OC192 port card supports a single 10G stream to the 
fabric and between a 10G and 20G egress stream. This board also 
uses 4 stripers and 4 unstriper, but the 4 chips operate in 
15 parallel on a wider data bus. The data sent to each fabric is 
identi cal for both OC48 and OC192 ports so data can flow between 
the port types without needing special conversion functions. 

Figure 3 shows a 10G concatenated network blade. 

Each 40G switch fabric enqueues up to 40Gbps cells/frames 
20 and dequeue them at 80Gbps. This 2X speed-up reduces the amount of 
traffic buffered at the fabric and lets the outbound ASIC digest 
bursts of traffic well above line rate. A switch fabric consists of 
three kinds of ASICs: aggregators, memory controllers, and 
separators. Nine aggregator ASICs receive 40Gbps of traffic from up 
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to 48 network blades and the control port. The aggregator ASICs 
combine the fabric route word and payload into a single data stream 
and TDM between its sources and places the resulting data on a wide 
output bus. An additional control bus (destid) is used to control 
5 how the memory controllers enqueue the data. The data stream from 
each aggregator ASIC then bit sliced into 12 memory controllers. 



The memory controller receives up to 16 cells/frames 
every 2D0MIIz clock cycle. Each of 12 ASICs stores 1/12 of the 
aggregated data streams. It then stores the incoming data based on 
10 control information received on the destid bus. Storage of data is 
simplified in the memory controller to be relatively unaware of 
packet boundaries (cache line concept) . All 12 ASICs dequeue the 
stored cells simultaneously at aggregated speed of 80Gbps. 



Nine separator ASICs perform the reverse function of the 
15 aggregator ASICs. Each separator receives data from all 12 memory 
controllers and decodes the routewords embedded in the data streams 
by the aggregator to find packet boundaries. Each separator ASIC 
then sends the data to up to 24 different unstripers depending on 
the exact destination indicated by the memory controller as data 
20 was being passed to the separator. 



The dequeue process is back-pressure driven. If 
back-pressure is applied to the unstriper, that back-pressure is 
communicated back to the separator. The separator and memory 
controllers also have a back-pressure mechanism which controls when 
25 a memory controller can dequeue traffic to an output port. 
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In order to support OC48 and OC192 efficiently in the 
chipset, the 4 OC48 ports from one port card are always routed to 
the same aggregator and from the same separator (the port 
connections for the aggregator & Sep are always symmetric. ) . The 
5 table below shows the port connections for the aggregator & sep on 
each fabric for the switch configurations. Since each aggregator 
is accepting traffic from 10G of ports, the addition of 40G of 
switch capacity only adds ports to 4 aggregators. This leads to a 
differing port connection pattern for the first four aggregators 
10 from the second 4 (and also the corresponding separators) . 



TABLE 2 : Agg/Sep port connections 



Switch Size Agg 1 Agg 2 Agg 3 Agg 4 Agg 5 Agg 6 Agg 7 Agg 8 

40 1,2,3,4 5,6,7,8 9,10,11,12 13,14,15,16 

80 1,2,3,4 5,6,7,8 9,10,11,12 13,14,15,16 17,18,19,20 21,22,23,24 25,26,27,28 29,30,31,32 

15 120 1,2,3,4 5,6,7,8 9,10,11,12, 13,14,15,16, 17,18,19,20 21,22,23,24 25,26,27,28 29,30,31,32 

33,34,35,36 37,38,39,40 41,42,43,44 45,46,47,48 

160 1,2,3,4 5,6,7,8 9,10,11,12, 13,14,15,16, 17,18,19,20, 21,22,23,24, 25,26,27,28, 29,30,31,32, 

33,34,35,36 37,38,39,40 41,42,43,44 45,46,47,48 49,50,51,52 53,54,55,56 57,58,59,60 61,62,63,64 



Figure 4 shows the connectivity of the fabric ASICs. 

The external interfaces of the switches are the Input Bus 
(BIB) between the striper ASIC and the ingress blade ASIC such as 
20 Vortex and the Output Bus (BOB) between the unstriper ASIC and the 
egress blade ASIC such as Trident. 

Two variations — erf — routewords — are — supported. "Phe — first 

option — uses — erre — 32 — feri-fc — routeword — which — i-s — passed — to — the — egress 
board as the egress routeword and has fields extracted to form the 
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f abric routeword . The second option allows th e striper to accept 

both a — fabric — routeword — (which happens on a dedicated — routeword 
bus ) — and an egress routeword — (which is received on the data bus) . 
The second option is more flexible on connection space usage and 
5 expansion since that allows all 32 bits of the routeword to be used 
to identify connections on switch egress. 

To maintain compatibility with Vortex, — hrt — 2-4 — irs — still 

maintained as — the multicast bit. 54*e — incoming — routeword has — the 

following format- — 

10 TADLE 3: 32 bit DID/DOD route word format 



bit 30:25 


bit 1 24 


bit 23:0 


Connection ID(29:28) & 
Connection 1D(19:16) 


Multicast Dit 


Connection ID (27:20) & connection ID (15:0) 



The 20 bit conn ID in the routeword is set to 

15 MC bit fk Connection ID — (29 : 5) — for UC connections which are 

not special — routeword values 

MC bit & Connection ID (24 : 0) — for MC connections or for 

special — routeword unicast values. 

For UC connections, — although bits 29 : 5 are passed to 

2 0 the fabric, — only bits 29 : 20 are used. These bits should be pro 

g rammed with queue to be used. Dits 29 : 20 should be programmed 

with the priority and bits 27:20 programmed with the queue 
number . 

Note that the RW value us e d for the outbound memory 

2, d con t r o ller is set to 
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*O f & MC bit — & connection ID (20:0). 



If the fabric is using 10 bits of conn ID, — this leaves 

20 bits — (1 M connections) — for use by the outbound memory 
controller . 



5 For double routewords, — no manipulation is done. ¥he 

value passed in on the routeword bus needs to equal to the 
connection ID to be transmitted on the backplane. — The following 
two tables show the routeword value which should be passed on the 
backplane routeword bus. — 



10 TABLE 4 : Unicast Conn e ction ID for s e parat e RW bus 



Mil A J 



bi t 24:23 



Bi t 22:15 



Multicas t bi t- 0 



Fab r ic priori ty 



Fab ri c queue ID 



Futu r e ex p ansio n b i ts. This b i ts arc 
t r ansm i tted t o the fabric, but the cu r r ent 



fab ri e igno r es them. futu r e fabr i cs ma> 



ex p and t o su pport these b it s ? 



TABLE 5 : Multicast Connection ID fo r separate RW bus 



UITTj 



bit 24 : 23 



b it 22:16 



bit 15:0 



Multicast bi t' ■ ! 



P r io ri ty queue ID 



Rese r ved. No t e t hese bi t s mul t icas t e o nnce t i on ID (0 t o 



a r c sen t to the fab r ie t o G41C) used by t he fab ri e 
allow future fabries t o 



su pport more co n nec t i o n 
spacer 



Op e cial — routewords aire flagged by using reserved queue 



numbers — (those in the range of 240 2DD] 



These routeword values 



indicate the receipt of an OAM cell which must get routed to the 
control port or a qu e ue r e synch operation. These special values 
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are always expressed in terms of the — connection — ID which — goes — to 

the — fabric. ff — special routewords — are given to the — fabric, — the 

memory — controller — routeword — must — also — be — modified — i-£ — these — are- 
getting passed in using the separate connection number bus. 



5 Wte — routeword — passed — to — the — fabric — will — contain — the 

multicast bit and the port mask bits — (bits 23 : 1C) . The routeword 

passed to — the — outbound memory controller — will maintain — the — port 
mask and also contain the vortex ID and the port ID. 



"Phe — connection — 3rB — of — an — 6AM — cell — hers — a — special — format 

10 generated by the Vortex AGIC : 

TABLE 6 : Conn e ction ID for OAM cell 



Bi t 24 : 23 



Mul t icast bi t-0 |¥ort ex ID (7 : 6) 



1 



i t 22:15 



bi t 14:9 



xFO (hex) 



Ve rt ex ID (5 :3) 



l/ll / • v 



eservcd 



Po rt ID 



Wte — Vortex — ffi — field — is — used — to — indicate — which — source 

15 Vortex ACIC the cell comes from. — The port ID indicates which port 
the cell comes from inside the Vortex ASIC. Note that OAM cells are 
all unicast. — Ai-i — OAM cells — are destined to one of — IDG blade — and 
control — port — queues — programmed — by — a — 0 bit — 9AM — cell — destination 

register — in the memory controller ASICs. If separate routeword 

20 busses — are being — us e d, — bit — 24 : 1G of the — DID_C0NN — field will be 
passed to the fabric. — The routeword which appears on the data bus 
(memory controller routeword) — should include the port mask, vortex 
ID and port ID fields in bits 23 : 0. — The value in the multicast bit 
is a don't care for the memory controller routeword. 
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Fabric queue ID 0xF0-0xF7 of the unicast connection ID is 

reserved for software use. All packets which have the fabric queue 
ID in range of OxFO OxFF will be redirected to one of the 4 control 
port queues based on a programmable register. 

5 54°re — connection — ffi — of — a — resync — cell — hers — t+re — following 

format. The resync cell is used to resynchronize queues in the mem - 
&ry — controller — ASICs ■ — Fabric — queue — ffi — OxFO OxFF — erf — the — unicast 
connection ID is reserved for special fabric functions. 



TABLE 7 : Conn e ction ID for R e sync c e ll 



10 bfrtt 



bit 22:15 



bi t 14:13 



bi t 12 : 0 



Multicast bi r- 0 



P r io r ity (unused) OxFF (hex) 



Numbe r o f Rese r ved 



pri o r i ti es pe r port 



Wre — number — erf — priority — queues — per — port — esm — only — be 

changed — during — the — queue — resync — period, — i . e . , — when — a — fabric — irs 
removed or inserted as follows: 



15 6-6-: — one priority per port for 400G switch, — pick bit 15 

down to 0 of the connection ID as the queue ID; 
-+ Mr: — two priorities per port for 240G switch, — pick bit 10 

down to 9 of the connection ID as the queue ID; 

10 : — A — priorities — per port — for — 120G — or — smaller switch, 
2 0 pick bit 17 down to 10 of the connection ID as the queue 

»r 

11 : — reserved 



-29- 



The resync cell can also be used to copy the shadow data 

register — to — a — valid — location — where — the — shadow — address — r egister 
points to. 



Ghadow — control — cell — is — used — to — copy — the — shadow — data 

5 register — to — a — valid — location — where — the — shadow — address — register 
points to. — The connection ID of a shadow control cell use. 



TABLE 0: Conn e ction ID for Shadow Control C e ll 



1 



b it 22:15 



it 14:0 



Multicast bi t— 0 



P r io ri ty 



QxrC(hcx) 



cservcd 



10 Data coming into the DID bus and out of the DOD bus — i-s- 

assumed to be filled onto the busses from most significant bit to 
least significant bit — (highest number bit to lowest number bit) . 



The Gtriper AOIC accepts data from the ingress port via 

the Input Dus — (DID) — (also known as DIN_DT_bl_ch bus) . 



15 This bus can either operate as 4 — separate 32 bit input 

buses — (4xOC40c) or a single 120 bit wide data bus with a common set 
of control lines to all stripers. This bus supports either cells or 
packets — based on — software — configuration — of — t+re — striper — chip . — ft 
consists of the following signals : 



20 



DID_Clock : — This clock is sourced by the Striper ADIC at 
up to — 100 MHz and — is used as — a — r e ference — for — data — smd- 
control signals on the DID. 
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DID_Dr : — This — signal — i-s — asserted — ( low) — t-o — indicate the 

striper — AGIC — cannot — take — data — on — t-he — btrs — dtre — to — a- 
bandwidth — difference — between — t+re — EtfrB — arrd — £HH3 — busses . 
Interfaces — which — rem — below — 93 — Mfte — will — never — s-ee — this 
5 signal asserted. — At 100 Mhz, — this signal is asserted if 

more — than — G553G — bytes — erf — back to back — data — are — given . 
This — signal — should be — sampled — srt — Hre — start — erf — packet . 
During a packet transfer, this signal will be asserted if 
the FIFO conditions would cause DP if the packet ended on 

10 t+re — current — clock cycle. If DF — irs — asserted the — clock 

cycle after the EOF, the striper will effectively ignore 

the input bus until the DF indication is withdrawn. ¥iw 

packet ingress stage should repeat the first word of the 
next packet transfer and then proceed with the — rest of 

15 the packet after the DF signal goes away. 



DID_Valid_L: This active low input signal delimits valid 

data on the DID_DOP, — DID_E0P, — and DID_DATA busses. — f-f 
this — signal — is — active , — the — busses — are — assumed — to — be 
valid. — If high, — the busses are treated as having invalid 
2 0 data for the current clock cycle. If a transfer is not in 

progress — (rt© — &&P — without — EOF — hers — been — given) — then — t+re- 
data bus is treated as invalid even if this signal — is a 
one. For cell interfaces, this signal can be tied active. 



DID_Cell_Pkt : — This signal is set to a one to indicate a 

25 cell transfer and a zero to indicate a packet transfer. 

Gignal needs to be valid the same clock cycle as start of 
cell . 
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- DID_Data [127 : 0] : This is the input 120 bit data bus. If 

running in 32 bit mode, — a cell consists of a 4 byte RW, 
a 4 byte Header, — and twelve 4 byte data words. — A packet 
has a RW and N data words, — where 1 ^ N. — If running in 120 
5 bit mode, — a cell has a 4 byte RW, — a 4 byte header, — and 0 

bytes of data in the first word, — 2 words with 1G bytes of 
data, — and a final word with 0 bytes of data, — if the data 
starts on a word boundary. A following cell can start on 
the half word boundary and have all fields offset by 0 
10 bytes . — Packets in 120 bit mode work in the same fashion 

as 32 bit mode, — except that EOT and GOT can have larger 

values. Minimum packet length supported is 1G bytes. 3r£- 

half word — boundary — cell — starts — srre — used, — the — correct 
value — (0/4 ) — needs to be given on the OOP bits 3:0. 



15 - DID_EOP [4 : 0] : This bus has two fields. Dit 4 is a one to 

indicate an EOF on the current transfer — (if DID_Valid_L 
•is — active) . — Bit — 4 — is a — zero to — indicate no EOP on the 
current transfer . — Dits 3:0 give the offset of the — last 
byte which is valid. The EOP field is not utilized for 

2 0 cell transfers. 



DID_0OF/C [1 : 0) : This bit indicates a start of packet or 

cell on the current bus cycle — (if DID_Valid_L is active) . 
A value of zero indicates start of transfer, — a value of 
one indicates no start of transfer. — Asserting bit 1-1 
2 5 indicat e s — that — the — upper — 64 — bits — carries — the — SOP — and- 
asserting — bit — &=i — indicates — that — the — lower — (t4 — bits 

carries the SOP (for 120 bit bus only) . For the 32 bit 

bus, OOP(0) should be used, 30P(1) should be tied high. 
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For the — 120 bit bus, — if a packet ends — in the upper — 6-4- 
bits of the bus, — a new packet can begin at bit G4 . 

• DID_CONN (24 : 0 ) : This is an optional bus. It can be used 

to pass — a — routeword to the — striper AOIC to — erse — srs — t+re 
5 fabric routeword, or the routeword can be transferred as 

the most significant 32 bits of the first word of data. 

The data should be valid the — same cycle as GOP/C. ¥he 

value — during — non -GOP/C — cycles — is — a — don' 1 — care . L Hre- 

interface — is — statically — configured — t-o — either — tt^e — the 
10 separate connection number bus or to expect the routeword 

on the data bus. 

Figure D shows a 32 bit DID cell transfer. 

Figure G shows a DID back pressure. 

Figure — ? — shows — a — 32 — birt — EHrB — packet — transfer — using 

15 external connection number bus. 

The unstriper ASIC sends data to the egress port via 
Output Bus (BOB) (also known as DOUT_UN_bl_ch bus) , which is a 64 
(or 256) bit data bus that can support either cell or packet. It 
consists of the following signals: 

20 This bus can either operate as 4 separate 32 bit output 

buses (4xOC48c) or a single 128 bit wide data bus with a common set 
of control lines from all Unstripers. This bus supports either 
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cells or packets based on software configuration of the unstriper 
chip. It consists of the following signals: 

* DOD_Clock: — This clock is sourced from the unstriper AOIC 
at up to 100 MHz and is used as a reference for data and 

5 control signals on the DOD. 

• DOD_DF: — This active low input — signal — indicates whether 

data can — be transferred (inactive) or cannot be 

transferred — (active) . When back pressure — irs — asserted, 

-the — unstriper — will — stop — advancing — the — output — btrs — and 

10 signal — data — — not — valid — using — the — DOD_valid — signal . 

Since synchronization must be done on both sides of the 
interfaces, — 0 clock cycles of data must be allowed from 
the assertion of DP to data stopping. — The source driving 
DOD_DF cannot make any assumptions on the data stopping 

15 or restarting except by examining D0D_Valid. 

DOD_Valid_L: — This — active — tera — output — signal indicates 
whether the bus has valid data or not during a transfer. 
This signal indicates invalid data only when DOD_DP has 
been asserted. 

2 0 DOD_Data : — This is the output bit data bus. — It can either 
be — 64 — bits wide or 25G bits wide. — If running in — 04 bit 
mode-? — a cell consists of a word with a 4 byte RW and a 4 
byte Header followed by G data words. A packet has a RW 
and N data words, where 1 a N. If running in 25G bit mode 

2 5 and a cell starts — on an even 32 byte word boundary, — a- 

cell has a word with a 4 byte RW a 4 byte header and 24 
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bytes of data in the first word, — and a second word with 
24 bytes of data. A following cell can start on the next 
used byte and have all fields offset by 0 bytes. — Valid 
cell — start — locations are all multiples of 0 — Kb — &7 — Hrr 
5 24 ) . — Packets in 120 bit mode work in the same fashion as 

32 — bit — mode, — except — that — E&P — artd — SOP — esn — have — larger 

values- Minimum packet length supported is 1G bytes. ff- 

half word — boundary — cell — starts — are — used, — t-hre — correct 
value — ( 0 / 4 ) — needs to be given on the OOP bits 3:0. 

10 - DOD_EOF: — This bit is asserted when the last transfer of 

a packet is occurring. 

* DOD_Cell_rkt : — This signal is set to a one to indicate a 

cell transfer and a zero to indicate a packet transfer- 
Signal needs to be valid the same clock cycle as start of 
15 cell. 

DOD_0OF/C — This — bit — is — a — zero — to — indicate — a — start of 

packet or cell on the current bus cycle. Data is always 

assumed to start at the most significant bit of the bus. 

Figure 0 shows a G4 bit DOD cell transfer. 

20 Figure 9 shows a G4 bit DOD packet transfer. 



Figure 10 shows an overview of the datapath of the switch 

ASICs. 
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Wre — data — em — t+re — data — btrs — transports — — optional — byte 

count — (32 bit word, — lower 1G bits are the byte count) — and a 32 bit 

egress — routeword . The unstriper core will — always produce a byte 

count . i-f — a — segmentation engine — is used to break the packet — tip 

5 into cells, — then the segmentation engine will drop the byte count 

word before — i± — irs — given to the — cell — interface . This — dropping — ars- 

only supported in OC40 mode. In OC192 mode, — the chipset will have 

no provisions for segmentation and dropping the byte count word. 

TABLE 9: OC40 DOD format 

10 QC48 Bi t s OC192bi t s fcabri Usage 

63 : 48 255:240 Unused r ese r ved fo r unst r i p e r use 

47 : 32 239:224 Dyte count Gives the cou nt of the number of bytes i n the packe t 

not count in g the 4 bytes fo r the eg r ess routeword and 
the bytes fo r the byte cou n t (basically, this co rr esponds 
to the byte eount of the r eceived p acket p lus/minus any 
changes fo r r eenca p sulation, p ushes, o r p o p s.) 

^hO 223 : 192 Cg r css RW Routewo r d fo r the eg r ess memo r y controlle r 

Nex t b it s s t a rt the da t a (bits (191 t o 0) fo r 192, nex t 
clock cycle fo r OC48 

The Synchronizer has two main purposes. The first 
15 purpose is to maintain logical cell/packet or datagram ordering 
across all fabrics. On the fabric ingress interface, datagrams 
arriving at more than one fabric from one port cards 1 s channels 
need to be processed in the same order across all fabrics. The 
Synchronizer's second purpose is to have a port cards 1 s egress 
20 channel re-assemble all segments or stripes of a datagram that 
belong together even though the datagram segments are being sent 
from more than one fabric and can arrive at the blade's egress 
inputs at different times. This mechanism needs to be maintained in 
a system that will have different net delays and varying amounts of 
25 clock drift between blades and fabrics. 
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The switch uses a system of a synchronized windows where 
start information is transmit around the system. Each transmitter 
and receiver can look at relative clock counts from the last 
resynch indication to synchronize data from multiple sources. The 
5 receiver will delay the receipt of data which is the first clock 
cycle of data in a synch period until a programmable delay after it 
receives the global synch indication. At this point, all data is 
considered to have been received simultaneously and fixed ordering 
is applied. Even though the delays for packet 0 and cell 0 caused 
10 them to be seen at the receivers in different orders due to delays 
through the box, the resulting ordering of both streams at receive 
time = 1 is the same, Packet 0, Cell 0 based on the physical bus 
from which they were received. 



Multiple cells or packets can be sent in one counter 
15 tick. All destinations will order all cells from the first 
interface before moving onto the second interface and so on. This 
cell synchronization technique is used on all cell interfaces. 
Differing resolutions are required on some interfaces. 



The Synchronizer consists of two main blocks, mainly, the 
20 transmitter and receiver. The transmitter block will reside in the 
Striper and Separator ASICs and the receiver block will reside in 
the Aggregator and Unstriper ASICs. The receiver in the Aggregator 
will handle up to 24(6 port cards x 4 channels) input lanes. The 
receiver in the Unstriper will handle up to 13(12 fabrics + 1 
25 parity fabric) input lanes. 
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When a sync pulse is received, the transmitter first 
calculates the number of clock cycles it is fast (denoted as N 
clocks) . 

The transmit synchronizer will interrupt the output 
5 stream and transmit N K characters indicating it is locking down. 
At the end of the lockdown sequence, the transmitter transmits a K 
character indicating that valid data will start on the next clock 
cycle. This next cycle valid indication is used by the receivers 
to synchronize traffic from all sources. Refer — to — "K character 
10 usage" on page 34 for the mapping of K characters to the functions. 

At the next end of transfer, the transmitter will then 
insert at least one idle on the interface. These idles allow the 
10 bit decoders to correctly resynchronize to the 10 bit serial 
code window if they fall out of synch. 

15 The receive synchronizer receives the global synch pulse 

and delays the synch pulse by a programmed number (which is 
programmed based on the maximum amount of transport delay a 
physical box can have) . After delaying the synch pulse, the 
receiver will then consider the clock cycle immediately after the 

20 synch character to be eligible to be received. Data is then 
received every clock cycle until the next synch character is seen 
on the input stream. This data is not considered to be eligible 
for receipt until the delayed global synch pulse is seen. 

Since transmitters and receivers will be on different 
25 physical boards and clocked by different oscillators, clock speed 
differences will exist between them. To bound the number of clock 
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cycles between different transmitters and receivers, a global sync 
pulse is used at the system level to resynchronize all sequence 
counters. Each chip is programmed to ensure that under all valid 
clock skews, each transmitter and receiver will think that it is 
5 fast by at least one clock cycle. Each chip then waits for the 
appropriate number of clock cycles they are into their current 
sync_pulse_window . This ensure that all sources run N* 

sync_pulse_window valid clock cycles between synch pulses. 

As an example, the synch pulse window could be programmed 
to 100 clocks, and the synch pulses sent out at a nominal rate of 
a synch pulse every 10,000 clocks. Based on a worst case drifts 
for both the synch pulse transmitter clocks and the synch pulse 
receiver clocks, there may actually be 9,995 to 10,005 clocks at 
the receiver for 10,000 clocks on the synch pulse transmitter. In 
this case, the synch pulse transmitter would be programmed to send 
out synch pulses every 10,006 clock cycles. The 10,006 clocks 
guarantees that all receivers must be in their next window. A 
receiver with a fast clock may have actually seen 10,012 clocks if 
the synch pulse transmitter has a slow clock. Since the synch 
pulse was received 12 clock cycles into the synch pulse window, the 
chip would delay for 12 clock cycles. Another receiver could seen 
10,006 clocks and lock down for 6 clock cycles at the end of the 
synch pulse window. In both cases, each source ran 10,100 clock 
cycles . 

25 When a port card or fabric is not present or has just 

been inserted and either of them is supposed to be driving the 
inputs of a receive synchronizer, the writing of data to the 
particular input FIFO will be inhibited since the input clock will 
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not be present or unstable and the status of the data lines will be 
unknown. When the port card or fabric is inserted, software must 
come in and enable the input to the byte lane to allow data from 
that source to be enabled. Writes to the input FIFO will be 
5 enabled. It is assumed that, the enable signal will be asserted 
after the data, routeword and clock from the port card or fabric 
are stable. 

At a system level, there will be a primary and secondary 
sync pulse transmitter residing on two separate fabrics. There 

10 will also be a sync pulse receiver on each fabric and blade. This 
can be seen in Figure [[11]] .5. A primary sync pulse transmitters 
will be a free-running sync pulse generator and a secondary sync 
pulse transmitter will synchronize its sync pulse to the primary. 
The sync pulse receivers will receive both primary and secondary 

15 sync pulses and based on an error checking algorithm, will select 
the correct sync pulse to forward on to the ASICs residing on that 
board. The sync pulse receiver will guarantee that a sync pulse is 
only forwarded to the rest of the board if the sync pulse from the 
sync pulse transmitters falls within its own sequence "0" count. 

20 For example, the sync pulse receiver and an Unstriper ASIC will 
both reside on the same Blade. The sync pulse receiver and the 
receive synchronizer in the Unstriper will be clocked from the same 
crystal oscillator, so no clock drift should be present between the 
clocks used to increment the internal sequence counters. The 

25 receive synchronizer will require that the sync pulse it receives 
will always reside in the "0" count window. 

If the sync pulse receiver determines that the primary 
sync pulse transmitter is out of sync, it will switch over to the 
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secondary sync pulse transmitter source. The secondary sync pulse 
transmitter will also determine that the primary sync pulse 
transmitter is out of sync and will start generating its own sync 
pulse independently of the primary sync pulse transmitter. This is 
5 the secondary sync pulse transmitter's primary mode of operation. 
If the sync pulse receiver determines that the primary sync pulse 
transmitter has become in sync once again, it will switch to the 
primary side. The secondary sync pulse transmitter will also 
determine that the primary sync pulse transmitter has become in 

10 sync once again and will switch back to a secondary mode. In the 
secondary mode, it will sync up its own sync pulse to the primary 
sync pulse. The sync pulse receiver will have less tolerance in 
its sync pulse filtering mechanism than the secondary sync pulse 
transmitter. The sync pulse receiver will switch over more quickly 

15 than the secondary sync pulse transmitter. This is done to ensure 
that all receiver synchronizers will have switched over to using 
the secondary sync pulse transmitter source before the secondary 
sync pulse transmitter switches over to a primary mode. 

Figure [[11]] 5. shows sync pulse distribution. 



20 In order to lockdown the backplane transmission from a 

fabric by the number of clock cycles indicated in the sync calcu- 
lation, the entire fabric must effectively freeze for that many 
clock cycles to ensure that the same enqueuing and dequeueing 
decisions stay in sync. This requires support in each of the 

25 fabric ASICs. Lockdown stops all functionality, including special 
functions like queue resynch. 
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The sync signal from the synch pulse receiver is 
distributed to all ASICs. Each fabric ASIC contains a counter in 
the core clock domain that counts clock cycles between global sync 
pulses. After the sync pulse if received, each ASIC calculates the 
5 number of clock cycles it is fast. (8). Because the global sync is 
not transferred with its own clock, the calculated lockdown cycle 
value may not be the same for all ASICs on the same fabric. This 
difference is accounted for by keeping all interface FIFOs at a 
depth where they can tolerate the maximum skew of lockdown counts. 

10 Lockdown cycles on all chips are always inserted at the 

same logical point relative to the beginning of the last sequence 
of "useful" (non-lockdown) cycles. That is, every chip will always 
execute the same number of "useful" cycles between lockdown events, 
even though the number of lockdown cycles varies. 



15 Lockdown may occur at different times on different chips. 

All fabric input FIFOs are initially set up such that lockdown can 
occur on either side of the FIFO first without the FIFO running dry 
or overflowing. On each chip-chip interface, there is a sync FIFO 
to account for lockdown cycles (as well as board trace lengths and 

20 clock skews) . The transmitter signals lockdown while it is locked 
down. The receiver does not push during indicated cycles, and does 
not pop during its own lockdown. The FIFO depth will vary 
depending on which chip locks down first, but the variation is 
bounded by the maximum number of lockdown cycles. The number of 

25 lockdown cycles a particular chip sees during one global sync 
period may vary, but they will all have the same number of useful 
cycles. The total number of lockdown cycles each chip on a 
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particular fabric sees will be the same, within a bounded 
tolerance . 

The Aggregator core clock domain completely stops for the 
lockdown duration - all flops and memory hold their state. Input 
5 FIFOs are allowed to build up. Lockdown bus cycles are inserted in 
the output queues. Exactly when the core lockdown is executed is 
dictated by when DOUT_AG bus protocol allows lockdown cycles to be 
inserted. DOUT_AG lockdown cycles are indicated on the DestID bus. 

The memory controller must lockdown all flops for the 
10 appropriate number of cycles. To reduce impact to the silicon area 
in the memory controller, a technique called propagated lockdown is 
used. 

The aggregator signals lockdown cycles on the DINJME bus. 

Tire — memory — controller — does — iwt — push — during — these — cycles . E H c re 

15 memory controller does not pop during lockdown to account for the 

non push cycles . Wte FIFO depth irs set during fabric 

synchronization to tolerate getting deeper or shallower depending 
on who locks down first. 

Lockdown idle cycles are inserted on the DOUT and CII_ID 

2 0 busses . — An extended sync signal is used to indicate the number of 
lockdown cycles on the DOUT_ME bus to aid the Separator' s lockdown 
function . 

The token bus lockdown looks the same as the DIN_ME bus 

from a memory controller perspective. — Non - push cycles are signaled 
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by — the — separators — according — t-e — their — lockdowns . 54°re — memory 

controller does not pop during lockdown. — The Separator locks down 
completely in a manner similar to the Aggregator. — DIN_SP and CII_ID 
lockdown — cycles — are — signaled — individually — per bus — vm — t+re — SYNC 

5 signals . Any — continuous — SYNC — assertion — after — t+re — first — one — » 

considered a lockdown cycle. Lockdown bus cycles are not pushed 

into the input FIFOs. 



54re — chip to - chip — communication — within — a — single — fabric 

must — be — synchronized. Although — no — clock — drift — exists — between 

10 chips , — differences — in — track — delays — cause — data — "bo — arrive — a-b 

different — Memory — Controllers — at — different — times . Aril — Memory 

Controllers — need to process — incoming packets — in exactly the — same 
logical order on each chip. — The Separators must align and combine 
multiple data slices coming from different Memory Controllers. — The 

15 Memory — Controllers — must — take — t+re — tokens — received — from — the 
Separators and apply them at exactly the same point in the logical 
packet flow, — or drop decisions may differ from chip to chip. 



The on-fabric chip-to-chip synchronization is executed at 
every sync pulse. While some sync error detecting capability may 

20 exist in some of the ASICs, it is the Unstriper's job to detect 
fabric synchronization errors and to remove the offending fabric. 
The chip-to-chip synchronization is a cascaded function that is 
done before any packet flow is enabled on the fabric. The 
synchronization flows from the Aggregator to the Memory Controller, 

25 to the Separator, and back to the Memory Controller. After the 
system reset, the Aggregators wait for the first global sync 
signal. When received, each Aggregator transmits a local sync 
command (value 0x2) on the DestID bus to each Memory Controller. 
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The Memory Controllers do not push anything into a DIN 

input FIFO until the first syne command is seen on that bus. ¥fr^r 

sync and every bus cycle following is constantly pushed into the 
input FIFO. — On the core side of the input FIFOs, no FIFO is popped 
5 until a sync appears in the FIFO from ev e ry Aggregator. — After two 
additional margin cycles, — every input FIFO is popped every cycle. 
After this point the input FIFO depths remain constant. — The depths 
are roughly a function of the track delays from each Aggregator. 
Immediately — after — the — Memory — Controllers — begin — sampling — the 
10 Aggregator input FIFOs, a sync signal — (G_GYNC_L) — is transmitted to 
all Geparators on the DOUT and CII_ID busses. 

Like the Memory Controllers, — the Separators do not push 

into the DIN and CII_ID busses until a sync signal is received on 
that bus. — The sync and everything after is constantly pushed into 
15 the input Tiro. 

&rt — t+re — core — side — the — Separator — always — waits — until — at- 

least one word is present — on all — input busses, — and then pops — the 
CII_ID and DIN busses simultaneously. — This will logically align the 

data stripes coming from the Memory Controllers. After the first 

2 0 combined sync is popped from the input FIFOs, — the Geparators send 
a sync signal on the TOKEN bus to the Memory Controllers. 

The Memory Controllers — do not push — into — the TOKEN bus 
input — FIFO until a — sync — signal — (0x3F on the — token bus) — h-a-s — been 
seen on the bus. — The sync and all subsequent tokens and idles are 
2 5 always pushed. 
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All M e mory Controllers need to apply the received tokens 

to the — same point — in the — incoming logical — flow — in order — for — srirt 

drop decisions to be identical. This is done by waiting a worst 

case number of clock cycles after the Separator sync transmission 
5 before beginning to pop the token input FIFO. — The worst case delay 
must be used because there is no way for a single Memory Controller 
to know exactly when all other Memory Controllers have received a 
token. — The programmable delay stored in the 1G bit Token Sync Wait 
Register — ±-s — irr — "useful" — cycles — (125MHz) — that — do not — include — t+re 

10 fabric lockdown — cycles . The worst case delay is — the worst — case 

skew for all data paths going from the Aggregator to Memory Con- 
troller to Separator and back to Memory Controller. 

The following Table 11 gives the min/max delays which the 

chipset — supports — and represent — the limits of what — irs — verified in 
15 the chip verification process. 

Sync — pulse — transport — delay — from — Transmitter — "bo — any 

individual chip receiving the sync pulse (WC path DC path) : 500 nS 

(min delay of 0, max delay of 500 nS) . — At 175 pa/inch, — this works 

otrfe — to — a — difference — of — about — 7 0m. Dackplane — transport — delay 

2 0 difference from local sync pulse receipt to reception of the sync 

indication — flag — by — t+re — far — end — chips : — 5-0-0 — rr£h Note — that — it — rs- 

desired — to — allot — about — 2-5 — rrs — of — this — to — the — chip — synchronizer 
operation which gives a delta path delay supported of 500 nS. 

Oscillators should — be HH3 — ppm — oscillators . Wte 

25 assumption of the d e sign was that the difference in transmission 
path delay was less than or equal to clock drift. — On board delays 
between chips have be e n designed to exceed the following specs: 
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Ohortest net : — 0 . 2 D " , — transport delay of pretty much 0. 
Longest net : — 25" , — transport delay is 5 nG, 

For any signal distribution. — The net delta delay between 

chips — ars — a multiplier of the number of busses — the — sync — ha-s — tra - 

5 versed . Since the sync goes through a receive synchronization to 

the — local — clock — erf — the — chip / — an — H" 8 — rrS — uncertainly — trers — to — be 

added at — each — stage — giving a net uncertainty of around 21 n o lor 
each hop. 

TABLE 11 ; — Fabric sync delay 

Notes 

Syne p ulse in 

Sy n c p ulse t o agg I agg_me del t a 



Sync pulse to agg i agg_mc i me_se p 
(no t e th i s sync pulse is delayed by t he 
mem or y — cont r olle r — for — pr o p aga t ed 
lockdown). 

eve r ything above i se p _me t okens. 



10 €htp Nu m be r — o fSkc w 

busses 

Agg + 21 nC 

Memory £ 42 nS 

controlle r 

him 
lTTTT 

15 Se p DIN 3- 



memo r y 4 84nS 

cont r oller 

token in 



Wre — control — port — follows — the — same — cell — flow — ers — the 

2 0 regular — ports . — c 5 L fre — switch — control — processor — sends — cells — to — the 
striper AOIC; — the striper stripes the cells and route words across 
sdi — fabrics . — An additional aggregator — ( 9th) — AOIC sends cells via 
the DOUT_AG/Dest ID buses to all 12 memory controllers. Each memory 
controller AOIC has an additional 9th DIN_ME_f b_se_9 bus. 

25 She — memory — controller — AOIC — will — route — the — incoming 

control — port — cells — to — arry — orre — of — the — control — port — de s tination 
queues and blade queues (up to 190 queues) . The 9th BOUT ME fb se 9 



_47- 



bus is used to send the control cells to the — 9th separator ASIC, 
which — sends — the — cells — to — one — erf — several — destination — unstriper 
ASICs . — £ Phe — unstriper — ASIC — reconstructs — the — cells — from — a-H: — 9th 
separator ASICs across all fabrics. — It sends the complete control 
5 cells to the switch control processor it is connected to. 

Note that the control port destination queues can be part 

of any multicast cells such that the multicast port mask is neces 
sary — to — include — additional — bit ( s) — to — indicate — the c o n t it o 1 port 
queue (s) . — 

10 There — er-re — at — most — 4 — control — ports — in — any — switch 

configurations . — This — limitation — i-s — dtre — to — the — aggregator — and 
separator ASICs only have 4 — 12 bit channels which can be scalable 
to different switch configurations, — respectively- — In other words, 
btM DIN_AG_fb_9_l_l, DIN_AG_f b_9_2_l , DIN_AG_f b_9_3_l , and 

15 DIN_AG_f b_9_4_l — of — the — aggregator ASIC — a-re — connected — to up — to — 4- 
control port striper ASICs . Dus D0UT_SP_fb_9_l_l, D0UT_ST_f b_9_2_l , 
DOUT_SP_fb_9_3_l, and D0UT_3P_f b_9_4_l of the separator ASIC are 
connected to up to 4 control port unstriper ASICs. 

The striping function assigns bits from incoming data 
20 streams to individual fabrics. Two items were optimized in deriving 
the striping assignment: 

1. Backplane efficiency should be optimized for OC48 
and OC192 

2. Backplane interconnection should not be 
25 significantly altered for OC192 operation. 
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These were traded off against additional muxing legs for 
the striper and unstriper ASICs. Irregardless of the optimization, 
the switch must have the same data format in the memory controller 
for both OC48 and OC192. 

5 Backplane efficiency requires that minimal padding be 

added when forming the backplane busses. Given the 12 bit backplane 
bus for OC48 and the 48 bit backplane bus for OC192, an optimal 
assignment requires that the number of unused bits for a transfer 
to be equal to (number__of_bytes *8 ) /bus_width where X V" is integer 
10 division. For OC48, the bus can have 0, 4 or 8 unutilized bits. For 
OC192 the bus can have 0, 8, 16, 24, 32, or 40 unutilized bits. 

This means that no bit can shift between 12 bit 
boundaries or else OC48 padding will not be optimal for certain 
packet lengths. 

15 For OC192c, maximum bandwidth utilization means that each 

striper must receive the same number of bits (which implies bit 
interleaving into the stripers) . When combined with the same 
backplane interconnection, this implies that in OC192c, each stripe 
must have exactly the correct number of bits come from each striper 

20 which has 1/4 of the bits. 

For the purpose of assigning data bits to fabrics, a 48 
bit frame is used. Inside the striper is a FIFO which is written 32 
bits wide at 80-100 MHz and read 24 bits wide at 125 MHz. Three 32 
bit words will yield four 24 bit words. Each pair of 24 bit words 
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is treated as a 48 bit frame. The assignments between bits and 
fabrics depends on the number of fabrics. 



TABLE 12: Bit striping function 
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The following tables give the byte lanes which are read 
first in the aggregator and written to first in the separator. The 
four channels are notated A,B,C,D. The different fabrics have 
different read/write order of the channels to allow for all busses 
5 to be fully utilized. 

One fabric-40G 



The next table gives the interface read order for the 
aggregator . 
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120G 
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C 


A 


D 


B 


2 


B 


C 


A 


D 


Par 


A 


D 


B 


C 



Three fabric-160G 
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Fabric 


1st 


2nd 


3rd 


4th 


0 


A 


B 


C 


D 


1 


D 


A 


B 


C 


2 


C 


D 


A 


B 


3 


B 


C 


D 


A 


Par 


A 


B 


C 


D 



Siz fabric-240 G 



Fabric 


1st 


2nd 


3rd 


4th 


0 


A 


D 


C 


B 


1 


B 


A 


D 


C 


2 


B 


A 


D 


C 


3 


C 


B 


A 


D 


4 


D 


C 


B 


A 


5 


D 


C 


B 


A 


Par 


A 


c 


D 


B 



Twelve Fabric-480 G 



Fabric 


1st 


2nd 


3rd 


4th 


0,1,2 


A 


D 


C 


B 


3,4,5 


B 


A 


D 


C 


6,7,8 


C 


B 


A 


D 


9,10,11 


D 


C 


B 


A 


Par 


A 


B 


C 


D 



Interfaces to the gigabit transceivers will utilize the 
transceiver bus as a split bus with two separate routeword and data 

25 busses. The routeword bus will be a fixed size (2 bits for OC48 
ingress, 4 bits for OC48 egress, 8 bits for OC192 ingress and 16 
bits for OC192 egress), the data bus is a variable sized bus. The 
transmit order will always have routeword bits at fixed locations. 
Every striping configuration has one transceiver that it used to 

30 talk to a destination in all valid configurations. That 
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transceiver will be used to send both routeword busses and to start 
sending the data. 

The backplane interface is physically implemented using 
125 MHz interfaces to the backplane transceivers. The 125 MHz bus 
5 for both ingress and egress is viewed as being composed of two 
halves, each with routeword data. The two bus halves may have 
information on separate packets if the first bus half ends a 
packet . 

For example, an OC48 interface going to the fabrics 
10 locally speaking has 24 data bits and 2 routeword bits @12 5 MHz . 
This bus will be utilized acting as if it has 2x (12 bit data bus 
+ 1 bit routeword bus) . The two bus halves are referred to as A 
and B. Bus A is the first data, followed by bus B. A packet can 
start on either bus A or B and end on either bus A or B. 

15 In mapping data bits and routeword bits to transceiver 

bits, the bus bits are interleaved. This ensures that all 
transceivers should have the same valid/invalid status, even if the 
striping amount changes. Routewords should be interpreted with bus 
A appearing before bus B. 

20 The bus A/Bus B concept closely corresponds to having 2-ErO- 

MHz interfaces between chips. 

All backplane busses support fragmentation of data. The 
protocol used marks the last transfer (via the final segment bit in 
the routeword) . All transfers which are not final segment need to 
25 utilize the entire bus width, even if that is not an even number of 
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bytes. Any given packet must be striped to the same number of 
fabrics for all transfers of that packet. If the striping amount 
is updated in the striper during transmission of a packet, it will 
only update the striping at the beginning of the next packet. 

5 Each transmitter on the ASICs will have the following I/O 

for each channel: 

8 bit data bus, 1 bit clock, 1 bit control. 

On the receive side, for channel the ASIC receives 

a receive clock, 8 bit data bus, 3 bit status bus. 

10 The switch optimizes the transceivers by mapping a 

transmitter to between 1 and 3 backplane pairs and each receiver 
with between 1 and 3 backplane pairs. This allows only enough 
transmitters to support . traffic needed in a configuration to be 
populated on the board while maintaining a complete set of 

15 backplane nets. The motivation for this optimization was to reduce 
the number of transceivers needed. 

The optimization was done while still requiring that at 
any time, two different striping amounts must be supported in the 
gigabit transceivers. This allows traffic to be enqueued from a 
20 striping data to one fabric and a striper striping data to two 
fabrics at the same time. 



in — all modes — erf — operation, — t+re — entire — 3.0G of data — i-s- 
always supported on switch ingress. For egress operation, — for 40G 
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and OOG, — the number of transceivers needed to support a — full — 

speedup — were — deemed — "bo — expensive . For — these — switch — modes / — t-tre 

output speedup is between l.D and 2, — All configurations above 00G 
support a full 2x speedup. 



5 Depending on the bus conf iguration, multiple channels may 

need to be concatenated together to form one larger bandwidth pipe 
(any time there is more than one transceiver in a logical 
connection. Although quad gbit transceivers can tie 4 channels 
together, this functionality is not used. Instead the receiving 
10 ASIC is responsible for synchronizing between the channels from one 
source. This is done in the same context as the generic 
synchronization algorithm. 



The 8b/10b encoding/decoding in the gigabit transceivers 
allow a number of control events to be sent over the channel. The 
15 notation for these control events are K characters and they are 
numbered based on the encoded 10 bit value. Several of these K 
characters are used in the chipset. The K characters used and 
their functions are given in the table below. 

TABLE 11: K Character usage 



20 K character Function Notes 

28.0 Sync indication Transmitted after lockdown cycles, treated as the prime 

synchronization event at the receivers 

28. 1 Lockdown Transmitted during iockdown cycles on the backplane 

28.2 Packet Abort Transmitted to indicate the card is unable to finish the 

current packet. Current use is limited to a port card 
being pulled while transmitting traffic 
28.3* Resync window Transmitted by the striper at the start of a synch 

window if a resynch will be contained in the current 
sync window 

25 28.4 BP set Transmitted by the striper if the bus is currently idle 

and the value of the bp bit must be set. 

28.5 Idle Indicates idle condition 

28.6 BP clr Transmitted by the striper if the bus is currently idle 

and the bp bit must be cleared. 
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The switch has a variable number of data bits supported 
to each backplane channel depending on the striping configuration 
for a packet. Within a set of transceivers, data is filled in the 
following order: 

5 F [ fabric] _ [ocl92 port number] [oc48 port designation 
(a,b,c,d)] [transceiver__number] 

Everything — ±n — t-he — documentation — w — done — fxrr — f abric-1 , 
which is the case where all connections are needed. — The only part 
of this which is used for fill order is transceiver_number — (0C4 0 ) 
10 and transceiver number and oc40 port designation for QC192. 

The fundamental rules for mapping are the following : 

■3r: BP — I — RW are on transceiver 1 — These always occupy the first 4 

bits of the transceiver. 

2i Data bits — starting with the — least — significant bit — stre — filled 

15 into the data bus — in a 2 bit bit interleaved pattern, — with bus A 
and bus D pairs. 

■3-: — Transceivers are filled in starting at bit 0 of their transmit 
and receive interfaces. 

4-. — All multibit routeword fi e lds are transmitt e d LSD to MOD. — This 
20 includes connection number, number of fabrics and encoded values of 
st op /align /final — segm e nt . c Phe — overall — routeword — irs — rtotated — a-s- 
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starting from bit 0 — (least significant bit) and up. — Transmit order 

is Dit 0 — (OOP) — goes on the first routeword bit, followed by bit 1 

(Packet type) . If multiple routeword bits are transmitted in the 

same clock they are filled in starting with the first bit going to 
5 bit 0/ — the second bit going to bit 1. 

Data — should — be — encoded — smd — decoded based — on — a — btre — A/Dus — & 

order. 

-G-. For — OC192, — the — fill — order — should be — btts — £r, — Br, — 67 — B — £w 

routeword bits. For — data bits, — the — fill — order — depends — cm — wack 

10 ing/unwacking/reverse unwacking and reverse wacking functions. 

Transceiver — t 

For an ingress bus, — the format of data is the following: 



&3rt- 


-e- 


-BP 


B±t- 


-3b- 


-e- 


B±*- 

B±t- 


-2- 
-3- 


RWA 
RWD 



Brtr 




Dataa (0) 


B±fr- 


-&- 


Dataa (1) 


B±t- 

B±t- 


-6- 


Datab(O) 
Datab(l) 



Note that for — 1-2 — fabric mode, — bits D and 7 are unused. 

The location of datab(O) — does not change. 



Pot — the — egress — bus, — the — format — of — the — data — i-s — the 

following : 
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B±t- 


-e- 


RWA(O) 


Bi-fc- 




RWA ( 1 ) 


Brtr 




RWD ( 0 ) 


B±t- 


■e- 


RWD ( 1 ) 


Bit- 


-4- 


■Dataa (0) 


Brt- 


-5- 


Dataa (1) 


frrb- 

B±-fc- 


-6- 
■^7- 


Datab (0) 
Datab (1) 



Transceiver 2 and up 

10 Fill up the data bus starting at each transceiver bit — 9- 

to bit 7 with 2 bit interleaved 
dataa/datab patterns. — 

For example , — transceiver 2 has the following pattern: 



Bi-te- 


-&- 


dataa (2) 


Brtr 




— dataa (3) 


B±tr 


Sr 


datab (2) 


Bi±r 


-e- 


datab (3) 


B±t- 


-k- 


-Dataa (4) 


B±t- 


-&- 


Dataa (5) 


Bi±- 

Bit- 


-6- 

-=h 


Datab (4) 
Datab (5) 



The stop/align encoding depends on the width of the bus interface. 



TADLE 12 : — OC40 portcard to fabric rout e word 3top/align 



r?:_i -i 

■ ' ItlU 


Length 




TO 


2 I n (where 
n is the 








Eloek cyelcs 


stop bit ef zero is seen, followed by the align bit and TS. Since step is followed by the align and 




FS bits, the stop bit is given 2 eleek eyeles before the end of data, 

Align bit is a one to indieate valid data on the last complete byte on the interface. Tor odd 12 bit 
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wo r ds(as3uming ze ro based coun t ing), align - 0 indica t es bits 0:3 are valid, and bits 4:1 1 arc 
inval i d. Align - 1 fo r these wo r ds indicates t hat all 1 2 bits a r e valid. Fo r even wo r ds, align should 
normally be a 1 . 

Sh ort p acke t s are indica t ed by signaling a s t o p o n byte 53 o f t he tr ansfe r . I n r eality, 54 by t es will 
be transfer r ed, bu t t he p aeke t is flagged as a sh ort packe t . 

T i nal segment is a one t o i ndica t e a final segmen t o f a paeke t and a ze ro to indicate a partial 
segmen t of a p acke t . O n ly one p acke t ean be in tr ansi t a t any one t ime on this bus. This bi t is only 
val i d f or p acke t s. Fo r cells th i s bi t should be a one. Packe t s wh i eh a r e n o t final segments should 
be lamina t ed o nly on odd eyelcs with all bits utilized. 



TABLE 13: — QC192 portcard to fabric routcword stop/align 



rr>i -i -i 
I 1 ICIU 



Len gt h 



Func t i o n 



Sto p /Align 



3 I 4 * 
numbe r of 



ex tr a clocks 



Due to length r es tr icti o ns on this bus, the s to p/al i gn has to be t reated diffe r ently tha n fo r OC48 
t r ansfe r s. 



The fi r st dock cycle, this field is 3 bits lo n g a n d is no t a t cd as SAFO. In all futu r e cl o ek cyeles the 



sto p field is 4 bi t s long and no t ated SAn. The definitions of SAFO and SAF1 a r e given below 
-BfH 



AFQ(O). Dit ze r o i s a ze r o t o indica t e a s t o p , a one to i n dica t e no sto p . 
SAF0(2: 1>"00" indicates full wo r d t r a n sfe r . 
'01" indica t es a full w o rd tr ansfe r but f or a sho r t p aeket. 
MO" indica t es a full wo r d tr ansfe r but no t the final segmen t . 
Ml" is r ese r ved. 



GAF1(0) Di t ze r o i s a ze r o t o indica t e a s t o p , a one to indica t e n o s t o p on the eu rr en t cycle. 



GAFl(3 : l) - binary value of the numbe r of val i d by t es. Ze ro is r ese r ved and 7 is used to indicate 



6 bytes valid bu t no t the final segmen t . 6 indica t es 6 bytes valid a n d f i nal segme nt . All partial 



wo r d tr a n sfe r s au t oma ti cally indica t e an implied final segmen t . 



TABLE 14: — OC48 Fabric-Port card routcword stop/align 

Field Leng t h Functi o n 

St op /Al i gn 3 i 2 * Value is t rea t ed as a r e p ea t ed 2 b it value (ene o ded s top ) f o llowed by the final segmen t bit. 

*FS number o f S top field i s in t e rpr e t ed as: 

Ex tr a el o eks 1 1 - con t inue 

00 - 1s t by t e finished is valid and s top 

01 - 2nd by t es fi n ished is valid and s top 

1 0 - 3rd by t e finished is valid and s to p, o r n o n - final segmen t . 
Sho rt paekcts a r c indica t ed by flagg i ng a s top at byte 53. 
final segment is a o ne f or a final segme n t, a ze ro f or a c o ntinuing p aeket. Tor final segmen t s '; 



-59- 







the atop field should be cneoded as a "10" 









35 The port ca r d - fab r ic inte r face at OC192 variable r outcwo r d bits are given in the tabic below. 



TABLE 15 : — OC192 Fab r ic^port card r outcword stop/align 



r- : - 1 .1 
I' ICIU 


Length 


Function 


Stop/Align 


7 i 8* numbei 


Bft Q indicates stop. Zero indicates stop, 1 continue. 


transfer 






Values OxC, Oxr are reserved. Any non-12 byte ending offset automatically signals end of segm : 


cyele of data. 









Depending on the switch configuration, — the bus may not 

4 0 transfer — art — integer — number — of — bytes . This — m — handled — by — the 

interface always flagging the bytes which finish and the transmit 
and receive — state machines must — track where bytes begin and end 
based on the current cycle in the transfer. 

¥he — btrs — consists — of — a — multiplexed — address/data — btrs- 

4 5 (AD_DATA) , a select signal (AD_DEL_L) , a read/write signal (AD_RW) , 
and a bus transaction complet e indication signal (AD_RD¥_L) . AD bus 
is used for read/write access of control/status registers. 

fn — order — to — write — to — a — control/ status — register, — the 

read/write signal (AD_RW) must be low. The select signal (AD_GEL_L) 
50 must — be — asserted — tow — for — the — entire — duration of — the — acc e ss, — and 
values must be plac e d on the AD_DATA bus in the following s e quence 
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(cycle — 9 — i-s — the — first — cycle — where — AD_GEL_L — rs — low — for — this 
transaction) : 

- cycle 2 5 : Data to be written to control/status register. For 

registers — that — are — wider — than — 0 - bits — (maximum — of — 32 bits) 
5 write data must be presented one byte per cycle starting with 

LGD . — Any data presented on the bus beyond the width of the 
register will be ignored. 

- cycles — > — 5-: — ASIC will — assert AD_RDY_L on — completion — of the 

write access, — and will keep it asserted until AD_GEL_L is de 
10 asserted. — 



Figure 12 shows a Write Cycle. 



in — order 


-bo — read 


-from — a — control /status — register, — the 


read/write — signal — 


(AD_RW) — 


must — be — high . "Phe — select — signal 


(ADJ3EL_L) — must be 


asserted 


low — for — the — entire — duration — of — the- 


access, — and — values 


must — be 


— placed — on — the — AD_DATA — btrs — am — the 


following sequence 


(cycle 0 


is the first cycle where AD GEL L is 



low for this transaction) : 

cycle 01 : Address of control/status register 

cycle 2 : AD_DATA bus should be released — (hi z) 

20 cycles — >€H — When the — data — rs — availabl e , — AGIC will — drive the 
read data onto the bus, — one byte p e r — cycle — for — four — cycles, 
along with assertion of AD_RD¥_L signal. For registers smaller 
than 32 bits wid e , unused bits are presented as zeros. The LGD 
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is present on the bus during the 1st clock cycle of AD_RD¥_L 
assertion . 

Figure 13 shows a Read Cycle. 

Wre — switch — chips — will — generate — interrupts — on — error 

5 conditions . ¥fre interrupt lines have the following 

characteristics : 

3n Level Gensitive 

5h Active Low 

-Eh Asynchronous (rro — clock — generated — t-o — go — along — with — t+re 

10 interrupt) . 

Assume point to point — interconnection with board logic which 

combines together interrupts. 

Interrupts are maskable on a condition by condition basis 

inside — each — chip . L Phe — interrupt — signal — irs — asserted — on — t+re 

15 occurrence — of — art — error — condition — arrd — is — cleared — when — t+re — error 

condition — ±-s — cleared . Any temporary conditions — which — caused an 

interrupt are recorded in th e chip so no phantom interrupts should 
be s e en. 



The reality of the switch is that errors will occur. — T4°re 

2 0 intent in the following is to detail the expected system behavior 
and recovery strategy need e d for each error type. 
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TADLE 1 G : — Ei ' ror r e covery in the AOICs 



Error 



De t ec t i o n Mechanis m 



Err o r rec ov er y re q u ired 



Ha r d w a r e c o mmen t s 



Stuck b it on p o r t eard egress 



uns tr i p e r sees da t a cor r up t i o n 
f ro m o ne fab r ie 



Stuck b it between agg & 



mem or y con tr olle r 



uns tr i p e r sees da t a c o rru pt i o n 
from one fabri e , ei t her r o u t e 



word or data. 



Stuek b it be t ween mem or y 



con tr olle r & se p a r a t o r 



unst rip e r sees data c orr u pt i o n 
from o ne fab r ie, eithe r ro u t e 



wo r d o r da t a 



S t uck bi t on fabric egress 



Soft - fail on r o u t ew o rd fr o m 



po rt ea r d 



A t leas t two unstr i pe r s see ei t he r Queue r esyneh 
a r ou t ewo r d mismatch, a stat e 



wi t h a high numbe r o f ro u t ew or d 



misma t ches, o r da t a p a r i t y e rror s 



o r a n y numbe r o f uns tr i p e r s will 
see a r outew or d misma t ch, a 



high number of r outewo r d m i s 



ma t ches or da t a pa r ity e rror s and 



an agg r egato r w i ll sec a sy n eh 



error? 



Wo r st — ease — scenario involves 



failing rou t ewo r d with different 
fab ri e — r outew or ds — to — fabr i cs 



Cithe r qucucing a p acke t to the 



wrong p o rt — or — d r o ppin g — the 



t r affic i n t h e agg r ega t o r can 



cause an — i m p act t o all po r ts. 



P r obability o f im p act i ng mo r e 



p o rt s goes up with traffie load 



and — memo r y ut i lization — rn 



memo r y con tr ol le r s r 



Soft - fail on data f ro m por t - 

card 



Unst r i p e r sees o n e time e rr o r . 



None 



pr obability of automa t ic ha r d 



wa r e baaed da t a r ecove r y is high 



Soft - fail between agg/memory 
con tr olle r des t id bus 



A t leas t two uns tr i p e r s see e i ther Queue r esyeh 



a r ou t eword mismatch, a s t a t e 



w it h a high numbe r of r ou t ew or d 



m i sma t ches, o r da t a parity errors 



soft - fa i l between agg/memo r y 
co n t r olle r da t a bus 



Uns t riper sees one time e rr o r , None 



pro babil i ty o f au t oma t ic ha r d 



wa r e based da t a r eeove r y i s h i gh 



soft - fai l between memo r y 



cont r olle r /sepa r ato r channel ID 
bus 



A t leas t two unst rip e r s see e it he r Queue rcsyneh 
a r o u t ew or d m i sma t c h , a s t a te 



with — a — high — number of 



misma t ches, o r da t a pari t y errors 



Tokens get o u t of syneh. — May 
see e rr o r of TlfO ove r flow in 



me — se p a r ator, — de p ending — on 



t raffic patte r n. Need congest i on 



on the fab r ie f or a p o rt t o have 
the — FIFO ove r flow — b ceo me 



p o s sible. — May also see exeess 



tokens in memory con tr olle r . 



soft - fail between mem or y 



Paeke t — boundaries f ro m — one Queue Resyneh 



con tro lle r /sepa r a t o r data bus f or 
RW data 



se p a r a t o r po r t are l o s t . Uns tr i p e r 



w i ll sh o w a large numbe r o f 
E rror s f or all tr affie fr o m t he 



affee t ed agg r ega t o r o u tp u t . 



Inhe r en t tha t n o self - s t abilize i n 



oeeu r s w/ o queue r esy n eh. 



soft - fail between memory 



Si n gle p o rt sees one -t ime e r r or . 



VI 



con t rolle r /separa t o r da t a bus f or 
j acke t da ta 
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soft-fail on token bus from 


differences — rn — separator 














Resct 




soft-fail internal to fabrie ehips 






Queue Rcsyneh may fix the 




problem, reset is neeessary foi 




















Aggregator — never — sets — flag 






plane idle to synchronise to rw 
bus 


sync 










synch 






Locating fault requires see in if 


indicating it has seen baek plane 
sync 




onfy — mrs — board — ts — bar — m 
problems — (backplane — syne 
receiver) or if multiple boards 
arc reporting problems (lost both 
syne signals on the back plane). 
Grror isolation in 40G switch 
requires looking at the state of 
the — secondary — syneh — pttfse 
generator 


A _ n , t 

1. £•..-..- 


















jvpui UlUI UVVH gVlt} I11111U1 

synch 












unstriper does not see baek 


synch 








Initialize the hardware 






Chips do not do anything 




Fault can be caused by failure of 
the on-board processor. If soft- 
fail, watchdog should catch it- 


Striper not initialized 


Transmit no data on the baek 

.1 

5i£inc 


Initialize striper 












Unstriper no initialized 


All incoming data ignored 


Initialize unstriper 




Offending data is dropped in 


Correct stripe amount 


Detection comes up as a result of 


striper, interrupt asserted 


a — disagreement — between — the 
stripe — amount — a-n-d — the 
configuration register for the 
switch operating mode. 


















Primary syne pulse TX failure 


Synch pulse receiver on all 


Replaee board with primary TX 










failure 


Syneh pulse receiver on all 
boards — wtH — see — error — on 






Replace board with bad syneh 


Meed to see how wide error is 






pulse receiver 




srror cither in an aggregator or 
an unstriper fed by this block. 















Board loses single syne pulse 
internal to the board 


If — arry — FIFOs — overflow in 
aggregator or unstriper, queue 
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rcsynch 














May see TITO overflow/undci 
flow in fabric ehip or see synch 


rcipiaix 


a rubric 


chip: — Additionally, if data is 
corrupted, — the — unstriper — wtH 








same as below. 


Hard failure on sync pulse 


unstriper-May see what looks 
like a single fabric mismatch due 


Reset port card 




distribution to a single ehip on 


to one fabrie going out of synch 


a port card 


before the others. 




XT 






soft failure on syne pulse 


If no FIFO overflow, none. — H 
TIFO overflow, need to reset 


Striper — missing — syneh — pulse 


distribution to a single ehip on 




could overflow a TITO on ever> 


board(s) with FIFO overflow. 


fabrie. Recovery would need te 




be — done — serially and — switeh 
could be effectively down b^ 
this error. Only way to ensure 
all fabrics do the same thing is to 
ensure that data path has the 
same delay as the syneh path 






Reset the fabrie 




soft failure on syne pulse 
on a fabric 














soft failure on sync pulse 
distribution to multiple chips 
on a port card 


Game as single-failure case 


Same as single-failurc 


Same as single-failure. 











The chipset implements certain functions which are 
described here. Most of the functions mentioned here have support 
in multiple ASICs, so documenting them on an ASIC by ASIC basis 
does not give a clear understanding of the full scope of the 
45 functions required. 

The switch chipset is architected to work with packets up 
to 64K + 6 bytes long. On the ingress side of the switch, there 
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are busses which are shared between multiple ports. For most 
packets, they are transmitted without any break from the start of 
packet to end of packet. However, this approach can lead to large 
delay variations for delay sensitive traffic. To allow delay 
5 sensitive traffic and long traffic to coexist on the same switch 
fabric, the concept of long packets is introduced. Basically long 
packets allow chunks of data to be sent to the queueing location, 
built up at the queueing location on a source basis and then added 
into the queue all at once when the end of the long packet is 
10 transferred. The definition of a long packet is based on the 
number of bits on each fabric. The following table gives the size 
of long packets for different switch sizes. 



TABLE 17 : — Long rack e t 3iz e s 



Swi t ch Si z e Packe t S iz e 

rt j. _ -\ 

15 4e 9m 

Of> 1 OftA 

UTT i uw 

i in TTAA 

iTu L T\j\j 

t rr\ t s r\r\ 

Tw JXJXJV 

-\ a r\ c a An 

iLTU JH\J\J 

2 A a on r\s r\r\ 

\J TUU J\J\J\J 



If the switch is running in an environment where Ethernet 
MTU is maintained throughout the network, long packets will not be 
seen in a switch greater than 40G in size. 

A wide cache-line shared memory technique is used to 
25 store cells/packets in the port/priority queues. The shared memory 

-rs OK entries — x 200 bit wide — running at 125MHz . Each memory 

controller AGIC yi e lds 25Gbps memory bandwidth. The aggregator # 9 
(control port) — generates at most 4 — streams of OC 4 0 — traffic . — c £4 a re 
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enqueue and dequeu e speed for different — switch configurations — rs- 
shown — am — t+re — following — table . — Note — that — a — zh* — speedup — erart — be 
achieved for all switch configurations except the 4Q0Gswitch. Up to 
234,057 cells can be stored in the 4Q0G switch. — The shared memory 
5 stores cells/packets continuously so that there is virtually no 
fragmentation and bandwidth waste in the shared memory. 



-For — the — short packets /cells , — memory utilization can be 
close to 100%. — For the long packets, — the memory block before the 

start — erf — a — long — packet — can — be — almost — completely — wasted. t i L fre 

10 minimum — length — f-or — a — long — packet — irs — 3 — cache — lines, — giving — art 

effective — utilization — of — memory — close — to — 1-5% since — 1 — otrfc — of — 4- 

memory cache — lines — can be wasted. — 



TABLE 18: — Shared Memory (1,638,400 bits) in Each Memory Controller 



Swi t ches 



E nq ueue Dc q utu e| S p ccdu p| Ccll Leng t h 

r» 1 



r< 1 



R.sitio 



Numbe r of 

^ -n _ 

C7CTT3 



i inr 
i r r\t~* 
a r\r* 

1UUU 



4.3Gb p 3 
4.7Gb p s 
.OGb p s 
§.3Gb p s 

-tr*<i 

9.4Gb p s 



20.7Gb p s 
20.3Gb p s 
■ OGb p s 
19.7Gb p s 
1 OGbps 
15.6Gb p s 



fr3 



Mr 



39i 1 b it s 
21 1 1 bits 
15i 1 bits 
1211 bits 
? 1 1 bi t s 
6 i 1 bi t s 



74,472 

102,400 

12 6 ,030 

163,840 

234,057 



There exists ttp — to — multiple queues in the shared 
memory. They are per-destination and priority based. All 
cells/packets which have the same output priority and blade/channel 
ID are stored in the same queue. Cells are always dequeued from 
25 the head of the list and enqueued into the tail of the queue. Each 
cell/packet consists of a portion of the egress route word, a 
packet length, and variable-length packet data. Cell and packets 
are stored continuously, i.e., the memory controller itself does 
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not recognize the boundaries of cells/packets for the unicast 
connections. The packet length is stored for MC packets. There is 
a limitation of 4K packets — (or cells) — in each of the MC queues. 

The multicast port mask memory 64Kxl6-bit is used to 
5 store the destination port mask for the multicast connections, one 
entry (or multiple entries) per multicast VC. The port masks of the 
head multicast connections indicated by the multicast DestID FIFOs 
are stored internally for the scheduling reference. The port mask 
memory is retrieved when the port mask of head connection is 
10 cleaned and a new head connection is provided. 

Two configurations of port mask memory are supported : 

€r: OK port connections, — for a 240 G switch 

fen 4K connections, — for a 400 G switch. 

Dequeue performance is restrict e d by several factors : — 3r)- 

15 Padding injected by the aggregator ASICs; 2) Left alignment entries 
inserted in the memory controllers; 3) Memory controller output bus 
fragmentation — caused by — the multicast — connections ; — 4i — Token — fotrs* 
latency — between — t+re — separators — and — t+re — memory — controllers; — B± 

Separator — output — btrs — padding; and — Unstriper — output — btre 

20 fragmentation. — A 400G switch is used as an example to analyze the 
worst - case performanc e — since — rt — has most — padding, — overhead, — and 
congested traffic. 

The aggregator ASICs have to pad a packet — (including 3G 

bit route word, — variable length packet length field and datagram) 
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to multiples of — 1-2 — since there are — 12 memory controllers — in one 
fabric . — The shortest packet each memory controller received is 7 
ferrfc — long — since — a — packet — eem — be — srs — short — srs — 04 bit — long . — Wre 

effective datagram is 3 bits. One entry will be — left aligned for 

5 every 1G 200 bit memory entries. — The left aligned entry can be as 
short as 1 bit long. The worst - case datagram dequeue efficiency per 
output port of a memory controller is: 

(10 bit — (dout_me bus width) — * (3/7) — (datagram length in a shortest 
packet) — * — (15/1 6) — (left - aligned overhead) ) — + 2 50MHz — (output bus 
10 speed) — * 12 — (number of memory controllers) — — (number of output 
ports per separator) — - 502b1bps — 

54°re — best -cas e — output — data — btrs — bandwidth — per — separator 

channel — irs — 2 bit — * — 2 50MHz, — i.e., — 500Mbps . — fn — other — words, — 'Phe 
worst case dequeue bandwidth of a memory controller is bigger than 
15 the 1 best - case output bandwidth of a separator port. — 2x speedup can 
be achieved through the — twice wide output bus — of the — separators . 
One — sync cycle will — be — fired on the output bus — of the — separator 
every 120 cycles. — 

The output bus of the una t riper ASIC is — G4 bit wide at 

2 0 100MHz ■ — It can only carry one packet p e r cycle. — In the worst case, 
up to 50 bits are wasted per packet for an OC40 port. 

APS stands for a Automatic Protection Switching, which is 
a SONET redundancy standard. To support APS feature in the switch, 
two output ports on two different port cards send roughly the same 
25 traffic. The memory controllers maintain one set of queues for an 
APS port and send duplicate data to both output ports. 
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To support data duplication in the memory controller 
ASIC, each one of [[192]] multiple unicast queues has a 
programmable APS bit. If the APS bit is set to one, a packet is 
dequeued to both output ports. If the APS bit is set to zero for 
5 a port, the unicast queue operates at the normal mode. If a port 
is configured as an APS slave, then it will read from the queues of 
the APS master port. For OC48 ports, the APS port is always on the 
same OC4 8 port on the adjacent port card. 

Port mirroring is similar to the APO except that any port 
can 1 pair with any port. — Only one pair of port mirroring ports are 
supported. — A lC-bit port mirror register is used to identify the 
master and slave port involved in the port mirror operation. — Mri- 
ports are compared to the master portion — (bit 15 : 0) of the register 
when dequeuing. — Port mirror can be disabled. — Note that a port can 
either — have APO — enabled or — port mirroring enable, — not both. — Wte- 
value — of — the — port — mirror — register — can be — changed on fly — by — the 
shadow registers . 

The shared memory queues in the memory controllers among 
the fabrics might be out of sync (i.e., same queues among different 
20 memory controller ASICs have different depths) due to clock drifts 
or a newly inserted fabric. It is important to bring the fabric 
queues to the valid and sync states from any arbitrary states. It 
is also desirable not to drop cells for any recovery mechanism. 

A resync cell is broadcast to all fabrics (new and 
25 existing) to enter the resync state. Fabrics will attempt to drain 
all of the traffic received before the resynch cell before queue 
resynch ends, but no traffic received after the resynch cell is 
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drained until queue resynch ends. A queue resynch ends when one of 
two events happens: 

1. A timer expires. 

2. The amount of new traffic (traffic received after the resynch 
5 cell) exceeds a threshold. 

At the end of queue resynch, all memory controllers will 
flush any left-over old traffic (traffic received before the queue 
resynch cell) . The freeing operation is fast enough to guarantee 
that all memory controllers can fill all of memory no matter when 
10 the resynch state was entered. 

Queue resynch impacts all 3 fabric ASICs. The 
aggregators must ensure that the FIFOs drain identically after a 
queue resynch cell. The memory controllers implement the queueing 
and dropping. The separators need to handle memory controllers 
15 dropping traffic and resetting the length parsing state machines 
when this happens. For details on support of queue resynch in 
individual ASICs, refer to the chip ADSs. 

Multicast connections are enqueued into one of 4 priority 
queues based on the 2 -bit priority number. — They are stored cache 
20 line based like the way unicast connections do. Connection numbers 
•arrd — lengths — a-re — stored — into — one — of — 4 — IK entry — per - priority 
connection FIFO. Multicast packets are subject to be dropped if the 
destined — connection — FIFO — irs — full . — in — other — words, — ert — most — 1-K 
multicast packets can be stored simultaneously for each priority. 
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Wre — C4KxlG - bit port mask memory will limit the number of 

multicast connections supported to G4K, — 32K, — 1CK, — 1GK, — 8*7 — and 4K 
for the 40G, 00G, 120G, 1G0G, 240G, and 400G switch, respectively. 

For the dequeue side, multicast connections have 
5 independent 32 tokens per port, each worth up 50-bit data or a 
complete packet- The head connection and its port mask of a higher 
priority queue is read out from the connection FIFO and the port 
mask memory every cycle (125MIIz) . A complete packet (or DO bits if 
the packet — i-s — longer — than — 50 bits) is isolated from the 200 - bit 

10 multicast cache line based on the length field of the head 
connection. The head packet is sent to all its destination ports. 
The 8 queue drainers transmit the packet to the separators when 
there are non-zero multicast tokens are available for the ports. 
Next head connection will be processed only when the current head 

15 packet is sent out to all its ports. 

-For — the worst — case analysis, — use the — 400G — switch as — art 
example where the shortest packet is 7 bit long. — Every 0ns cycle 
only one connection can be handled (bottlenecked by the connection 
FIFO and port mask memory) . — If the multicast only goes to 1 port, 

2 0 t+re — effective dequeue throughput — for — the multicast — connection — «■ 
07 5Mbps — ottt — of available — 15Gbps — shared memory dequeue bandwidth, 

i.e., — &%i In other words, — the multicast performance is severely 

damaged by the bottlen e cks existing in the connection FIFO, — port 
mask memory, and head offline blocking. The throughput for the 400G 

25 switch is 400 + 7 + n/Q0-n + 42G where n is number of copies a multicast 

connection destined . In the worst case where n-1, — the multicast 

throughput — is about — 9% available switch capacity. — If the average 
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multicast connections make 11 copies, — the switch can achieve 400G 
throughput . — 

The longer a packet is — (for the 240G switch or smaller 

configurations) , — the more ports a multicast — connection destined , 
5 the - dequeue — performance — becomes — better — significantly . — Multicast 
performance — do — not — intervene — t+re — dequeue — speedup — — unicast 
connections since the latter has their own tokens and two types of 
connections share the dout_me bus alternatively in a strict round 1 
robin fashion, i.e., the multicast connections do not block unicast 
10 ones . 

There are 192 unicast queues, — 4 multicast queues, — and 4 

control port queues. — 4 multicast queues are per priority based and 
can broadcast to any subset of 192 output ports and the 4 — control 
ports . 

15 There — a-re — trp — to — 3-9-6 — destination — channels — (192 — blade 

channels and 4 control ports) — for the 400G switch. Each destination 
ters — a — one to one — mapped — unicast — queue . — 4 — multicast — queues — can 
broadcast to any subsets of 192 regular ports indicated by the per 
connection based port mask entry. — An OC 192 port uses one out of 

20 4 queue locations. Other three queues are unus e d. All 0-bit fabric 
queue — ID field on the — DestID bus — is used to identify one — erf — HHr 
ports . — 2 - bit priority field is unused. 

For the 240G switch, Up to 100 destination channels exist 
-f^6 — blade — channels — and — 4 — control — ports) . — 9-6 — unicast — destination 
25 queues — have — 2 — priority — queues — each . — 4 — multicast — queues — can 
broadcast — t-o — any — subsets — of — 9G ports indicated by the — per con- 
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nection based port mask entry. An OC 192 port uses one out of 4 

queue locations. — Other three queues are unused. — Lower 7 bit queue 
ID is used to identify one of 100 ports and lower 1 bit of priority 
field is used to identify one of two priority queues in each port. 
5 Other queue ID bit and priority bit is unused. 

For the 1G0G switch, — Up to GO destination chann e ls exist 

-f€r4 — blade — channels — and 4 — control ports) . — — unicast — destination 
queues have 2 priority queues each. — There are — &6 — unused queues — 4* 
multicast queues can broadcast to any subsets of GO ports indicated 
10 by the per — connection bas e d port mask entry. — An OC - 192 port uses 
one — otrt — erf — 4 — queue — locations . — Other — three — queues — a-re — unused. — 
Lower 7 bit queue ID is used to identify one of 100 ports and lower 
1 - bit — erf — priority — field — irs — used to — identify one — erf — two priority 
queues in each port. — Other queue ID bit and priority bit is unused. 

15 Ferr — the — 120G — err — smaller — switch, — Bp — to — &2 — destination 

channels exist — (40 blade channels and 4 control ports) . — 40 unicast 
destination queues have 4 priority queues each. — 4 multicast queues 

can broadcast — to any subsets of 40 — ports — indicated by the per 

connection based port mask entry. An OC-192 port uses one out of 

20 -4 — queue — locations . — Other — three — queues — a-re — unused. Lower — G - bit 

queue — ID is — used to — identify one of — Er2 — ports and 2 bit priority 
field is used to identify one of 4 priority queues in each port. 
Other queue ID bits are unused. 

Queue structure can be changed on fly through the fabric 
25 resync cell where the number of priority per port field is used to 
indicate how many priority queues each port has. 
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The stripper ASIC resides on the n e twork blade. — It has 
following features : 

Support packet/cell interfaces. — Can accept up to 3 GD/sec of 
sustained traffic — (3.2 GD/sec in bursts) — of cells, — frames, — o-r 
5 a mix of cell and frame traffic. 

■ Generates fabric routeword for all fabrics in the switch 
* Calculates data for the parity fabric and adds checksum to the 
end of each packet. 

Support switch configuration: 400, 000,1200,1000,2400, and 400G 

10 Generates — appropriate — signals — t-e — interface — directly — to the 
transmit side of the Gbit transceivers. 

The Striper takes DID cell/packet format from the ingress 

port ASIC. — For the ATM interface, — the ASX cell format is accepted 
from the Vortex ASIC — of — t+re — Foseidon — chipset — grfc — 2 . 5Gbps — ftrr — the 
15 channelized blade . — ft — consists — of — 4 byte — route word, — 4 byte ATM 
cell — header — (without — IIEC byte) , — and — 40 byte payload. — 30 bit — t+re 
switch — route — word — esm — be — generated — based — on — the — ASM — route — word 
provided by the Vortex ASIC. 

E Phe — Striper — ASIC — consists — of — thre e — ma j or — blocks : — the 

20 switch — route — word — generator, the — switch — payload — 6 — checksum 

generator, — and the switch parity g e nerator. 

f4°re — switch — payload — generator — forwards — 4 byte — A J PM — cell 

head, — 40 byte ATM cell — payload and — 2 - byte — checksum to — up — t-o — 1-2- 
switch fabrics and 1 spare fabric. — The cell bus is 2x 12 bit wide 
25 running at 125MIIz, 
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The Gtriper AGIC duplicates the pack e t/cell and transmits 

various — fragments — to — t+te — fabrics . — 1-2 — data — output — buses — erf — t+re 
striper — ADICs — srre — connected — to — the — data — input — buses — of — the 
aggregator ASICs on the fabrics as to Hows : 

5 Figure 14 shows strip AGIC architecture - 



TABLE 1 9 : — Data bus conn e ctivity of th e Gtrip e r AGIC of blade #1 
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Wre striper — AGICs — on — blade fri is — connected — with 

aggregator — AGIC — fri — of — aii — switch — fabrics . — Hre — striper — AGICs — on 
25 blade — #2 — ts — connected — with — aggregator — AGIC — #2 — of — sriri — switch 
fabrics. The striper AGICs on blade #4 is connected with aggregator 
AGIC #4 of all switch fabrics. The striper AGICs on blade # 5 to #0 
are connected with aggregator AGIC #£ > to #0 of all switch fabrics, 
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respectively . — The striper AGICs on blade # 41 to #40 are connected 
with aggregator ASIC #5 to # 0 of all switch fabrics, — respectively. 
In other words, — blade number moduled by 0 — is the aggregator A3IC 
number which a striper AOIC is connected to. — 

5 The parity bits are sent to the spare fabric. The purpose 

of the — spare — fabric — is to provide — fault tolerance ability to the 
switch, — i.e., — in case one of the switch fabrics failed, — the spare 
fabric recovers the lost part of the cell. This is achieved through 
a — parity — bit — generator — ort — the — striper — AGIC . — For — one — fabric 
10 configuration, — the 12 bit cell payload is duplicated to the spare 
fabric; for 2 - fabric configuration, G bit parity bits are generated 
as follows : 

parity bit(l:G) - cell bit(l:C) exclusive OR cell bit(7:12); 

For 3 - f abric — configuration, 4 bit — parity — bits — are 

15 generated as — follows : 

parity — bit (1:4) - — cell — bit (1:4) exclusive OR — cell — bit (5 : 0) 

eAcluaive OR(D 12) ; 

'Phe — route — word — generator — regenerates — the — switch — route 

word and sends up to 12 1 1 1 bit 250MHz route word buses for fabric 
20 1,2,3, . , — 12 and the spare fabric. 

The aggregator AGIC resides on the switch fabric as shown 

in the following figure. — Each 40G switch fabric has Oil aggregator 
AOICs. It aggregat e s Cx4 separate cell streams and route words into 
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a single 12G stream from up to G blades and 4 channels. All input 
signals from the network blades are 250MHz point to point IISTL. — J*- 
outputs a single cell stream that is multiplexed with cell payload 
and route words to 12 memory controllers. — The ASIC has following 
5 features : 

- 12Gbps Data and route word input from up to G network blades 

and 4 — channels 
Route word separation and aggregation 

Output 12G data and route word to 12 memory controller ASICs 

10 • IIOTL interface with the memory controller, — receiver interface 
for the backplane gigabit transceivers, 

Figure ID shows aggregator ASIC architecture. 

The aggregator AOIC supports 40G, 00G, 120G, 1G0G, 240G, 

and — 400G — switch — configuration — without — backplane — change . ¥he- 

15 backplane connectivity (DIN_AG buses) of a pair of aggregator ASICs 
is shown as — follows : 



TABLE 20 : DIN_AG bus conn e ctivity of aggr e gator ASIC # 1 and #5 of 

switch fabric # 1 
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The 2 x G DIN_AG bus e s of aggregator AGIC #1 and #5 pair 

trf — switch — fabric — fri — is — connected to — the — 1-2 — x — DOUT_GT bus — #i — erf 
blade — Hr? — &t — ^7 — 3^7 — — &n — &n — 2-9-? — 33-? — ^7 — 4*7 — and — 

10 respectively . — The 2 x G DIN_AG buses of aggregator AGIC # 2 and #0 
pair of switch fabric # 1 is connected to the 12 x DOUT_GT bus #1 of 

blade #2, 67 — ±e~, — 3r^ — 3^ — 2*7 — £67 — 36-; — 3^7 — 3©^ — ^ — and 4G, 

respectively . — The 2 x G DIN_AG buses of aggregator AGIC # 3 and #7 
pair of switch fabric #1 is connected to the 12 x DOUT_GT bus # 1 of 

15 blade — $3-, — ^ — Hr? — ^7 — 3^7 — — 2=h — 3*7 — 357 — 3^7 — 4-3^ — and 47, 
respectively. The 2 x G DIN_AG buses of aggregator AGIC # 4 and # 0 
pa 3_ r o f switch fabric #1 is connected to the 12 x DOUT GT bus # 1 of 

blade — H-7 8-7 3r&7 — 2^7 — 2*7 — 2*7 32-? 3*7 *Qi ^7 and — % 

respectively . 

20 Likewise, — the 2 x G DIN_AG buses of aggregator AGIC # 1 

and #5 pair of switch fabric # 2 is connected to the 12 x DOUT_GT 
bus #2 of blade #1, 5, 0, 13, 17, 21, 25, 20, 33, 37, 41, and 45, 
respectively. The 2 x G DIN_AG buses of aggregator AGIC #1 and #5 
pair of switch fabric # 12 is connect e d to the 12 x DOUT_GT bus #12 

25 of blade #1, — b- f — £7 — — ^7 — — &n — 2*7 — — B=h — 4*7 — and 45, 
respectively, — for the 400G switch configuration. — 
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'Phe — above — connectivity — irs — repeated — 4 — times — fcrr — t+re 

channelized blad e s. 

For the 4&€r, fr&eT 120G, lG0G f 240G, and 4*Oe 

configuration, — each blade channel sends — 12 x 30 bit cell payload 
5 and 3G — bit route word, — 0 x 3G - bit payload and 3G bit route word, 
■4 — x — 3G-bit payload and 3G bit — route word, — 3 — x — 3G bit payload and 
3G bit route word, — 2 x 30 bit payload and 3G bit route word, — and 1 
x — 3G bit — payload — artd — 3G bit — route — word — t-o — each — switch — fabric, 
respectively . — in — other — words, — t+re — whole — 12 - bit — wide — cell — irs- 
10 transmitted in the same fabric for the 40G switch while only a 1 
bit wide — (1/12 cell) — cell slice is transmitted on each fabric for 
the 400G switch. 

The GO bit D0UT_AG bus is split onto 12 memory controller 

ASICs, — each receiving 5-bit data and 1 bit clock signal from one 
15 aggregator — ASIC. — Wte — 15 ■ bit — DestID — brrs — « — broadcast — to — a-3hfc — £-2- 
memory controllers . — Due to the — fan out — load concern, — 3 copies — of- 
the signals are maintained, — each driving 4 ASIC loads. 

Every channel of the aggregator sends up to 12x3x200 bit 

cell/packet — stream — to — 3-2 — memory — controller — based — on — a — work 

2 0 conserving round robin dequeue algorithm, — i.e., — next source takes 
over if the current source runs out of eligible cells/packets to 
send-. — Strict round robin algorithm is used among 24 — sources . — For 
the 40G switch, only 4 source channels exist. — A source is eligible 
to send a cell/packet whenever a full cell or a full short packet 

2 5 or a 12x3x200 bit segment of a long packet is received. — 
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Each m e mory controll e r AOIC r e c e ives 9 independent c e ll 

streams from 9 aggregator ASICs. There are 9 250MIIz DIN_ME_fb_se 
buses, — each consisting of a 5 bit data bus, — a 1 bit clock signal, 
and a ID - bit DestID bus. — The CO bit D0UT_AG data buses of all 9 
5 aggregator AGICs are bit sliced onto 12 memory controllers, — each 
receiving 5 bit data from on e D0UT_AG bus. — Ev e ry m e mory controll e r 
g e ts a separate non sharing clock signal — (named clkl to clkl2) — from 
each D0UT_AG bus to r e duce the load of the clock pin whil e 3 memory 
controllers share a set of DestID bus from the D0UT_AG bus. — The 9 
10 DIN_ME_f b_se — buses — ©-£ — memory — controller — fri — axe — connected — to — the 
D0UT_AG buses of 9 aggregators as follows : — 

DIN_ME_fb_l_l_data " DOUT_AG_fb_l_data [40, 3G, 24, 12, 0] 

DIN_ME_fb_l_l_dest " DOUT_AG^fb_l_destl 
DIN_ME_fb_l_l_clk DOUT_AG_f b_l_cl kl 

15 DIN_ME_fb_l_2_data — D0UT_AG_fb_2_data [40 , 3G, 24, 12, 0] 

DI N_ME_f b_l_2_de a t D0UT_AG_f b_2_dest 1 

DIN_ME_fb_l_2_clk - DOUT_AG_fb_2_clkl 

DIN_ME_fb_l_3_data D0UT_AG_f b_3_da t a [40, 3 G , 24, 12,0] 

DIH_ME_fb_l_3_dest D0UT_AG_f b_3_d e stl 

20 DIN_ME_f b_l_3_cl k DOUT_AG_f b_3_cl kl 

DIN_ME_fb_l_4_data - DOUT_AG_fb_4_daLa [40 , 30, 24 , 12 , 0 ] 

-> DIN_ME_fb_l_4_dest DOUT_AG_f b_4_des t 1 

DIN_ME_fb_l_4_clk DOUT_AG_f b_4__cl kl 

DIN_ME_fb_l_5_data ^ DOUT_AG_fb_0_data [40, 30, 24, 12, 0] 

25 DIN_ME_fb_l_0_de3t " DOUT_AG_fb_D_d e stl 

DIN ME fb 1 S elk -■ DOUT AG fb 5 clkl 
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DIN_ME_fb_l_G_data DOUT_AG_fb_G_data [40, 3G, 24 , 12, 0] 

DIN_ME_fb_l_G_dest DOUT_AG_f b_G_de A 1 1 

DIN_ME_fb_l_G_clk - DOUT_AG_fb_G_clkl 

DIN_ME_fb_l_7_data - DOUT_AG_fb_7_data [40, 3G, 24, 12, 0] 

5 DIN_ME_fb_l_7_dest - DOUT_AG_f b_7_dest 1 

DIN_ME_fb_l_7_clk DOUT_AG_f b_7_cl kl 

DIN_ME_fb_l_0_data DOUT_AG_fb_0_data [40, 3G, 24, 12, 0] 

DIN_ME_fb_l_0_de&t - DOUT_AG_fb_0_destl 

DIN_MD_fb_l_0_clk - DOUT_AG_fb_0_clkl 

10 DIN_ME_fb_l_9_data DOUT_AG_f b_9_data [40, 3G, 24, 12, 0] 

DIN_ME_fb_l_9_dest - DOUT_AG_f b_9_dest 1 

DIN_ME_fb_l_9_clk - DOUT_AG_f b_9_cl kl 

Tire — DIN_ME — data — bus e s — erf — memory controller — #2 — a-re 

connect e d to bit 49, 37,25,13, and 1 of the D0UT_AG data buse.5 of 9 
15 aggr e gators, — and so on. Th e DIN_ME data bus e s of memory controll e r 
#12 are connected to bit 59,47,35,23, and 11 of the DOUT_AG data 
buses of 9 aggr e gators. 

12 memory controller ADICs aggregate cell /packet streams 

•f-rom — 8-+-1 — aggregator ASICs. — Then — write — tne — cells — into — one — erf — 200 

20 output — queues — (e.g., — 3r2 — n e twork blades — x — 4 — channelized — Poseidon 
interfac e s x 4 prioriti e s for unieast — I — 4 prioriti e s for multicast 

h — 4 eontrol port qu e ues) . The 0 bit d e stination queue numb e r on 

t+re — DestID — btrs — rs — us e d — a-s — t+re — output — qu e u e — indicator — for — the- 
unieast — connection . — 'Pne — multicast — c e ll — ts — stored — into — one — o-f — 4- 

2 5 priority queu e s based on the 2 bit priority on the DestID bus. — Tne- 
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1G bit multicast connection number on the D e stID bus will be used 
to lookup the internal port mask memory to find out the destination 
blade and channels during the dequeue phase. — 

The memory controllers send out cell/packet traffic from 

5 — output — queues — -bo — 8-Hfc — s e parator AOICs. — Dequeuing — speed is — sts- 
twice fast as enqueuing speed to reduce amount of cells buffered on 
the switch fabric. 

• Support both variable length packet switching and fixed-length 
cell switching 

10 12 AOICs are bit-sliced and function as an integrated shared 

memory controller 

Oupport 4-eeT eeer 120G, IGOG, 240G, arrd 400G switch 

configurations 
■ Enqueue cells/packets from 9 aggregator AOICs 
15 * 2x dequeue speedup to 9 separator AOICs 
On - chip APO support 

234,057 cells on chip buffer — 

200 programmable destination queues 

On - chip control port support 

2 0 * G4K multicast connections, — 2 A 32 unicast connections. 
Per - queu e transmit and loss counts 

Figure 1G shows memory controller AOIC architecture. 

ft — QKxl3 bit — link — list — rs — used — t-e — maintain — free/used 

memory entry list pointer. A fre e entry is requested from the free 
2 5 link list wh e n writing data into the shar e d memory and the current 
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tail — cache — line runs — out of space . — Complete cell /packet — will — be- 
dropped whenever the free list is empty, — i.e., the shared memory is 

f ull . — A memory entry is free to the free list after the memory 

word — is transmitted to the separator ASICs. 

5 Figure 3t7 shows wide cache line shared memory 

architecture . 

DIN_ME_fb_se_9 — and — DOUT_ME_f b_se_9 — buses — are — used — t-cr 

connect to aggregator #9 and separator #9, — which communicate with 
the control port striper and unstrip e r AGICs only. — It has the same 
10 DestID and cell format as other 0 buses do. — Its cells are enqueued 
and dequeued in the same way as the regular cells. — 

There are up to — 4 — additional — control port — queues . — They 

have queue ID from 192 to 195. — All unicast connections having the 
control port queue ID as its fabric queue ID is enqueued into the 
15 relative — control — port — queue . — There — are — at — most — 4 — OC - 12 — control 
ports supported. 

Each — control — pert — queue — hers — a — 13 bit — control — port 

register as follows: 



TABLE 2 1 : — 13 bit Control port qu e u e r e gister 



20 



Bit 12:5 


Brbi 


Bit* 




Dlt 1 


Ull V 






Control Pert 3 enable 


Control Pert 2 enable 


Centrel Pert 1 enable 


Control Pert 0 enable 


8-bit regular pert ID 


Regular Pert enable 











& — queue — eem — be — multicast — to — tip — tro — 4 — physical — control 

ports — arrd — orre — regular — queue . — When — a — queue — rs — redirected — to — the 
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regular queue, — that queue must be disabled for the regular queue 
traffic . — Packets are queued in the same way as the regular queues 
do-; — i.e., — 200 - bit — cache — line based. — Left — aligned every — 1G cache 
lines . — Otrict — round robin — among — 4 — queues — when — a — left alignment 
5 entry is transmitted. A queue is routed to 4 control ports and one 
regular port based on the 5 bit control port enable vector. — 

Two dequeu e algorithms are applied among 4 control port 

queues : 

— &-) One control port only talks to one cp queue: — Pure round 

10 robin dequeue among 4 non empty control port queues which 

have non- zero unicast tokens; one token worth unicast — (-op 
to 200 1 bit) — is sent out to dout_me bus for a port; 

• b) 9rte — control — port — talks — to — multicast — cp queues: — Otrict 

priority — among — 4 — control — port — queues; — queue — 3-9-2 — hers- 
15 highest priority and queue 195 has lowest; — switch queues 

when the end of the packet is seen. 

OAM cells are identified by the Fabric queue ID field. If 

this field of a unicast connection has value OxFx(h), then it is an 
OAM cell. — All OAM cells can be mapped into one of the 192 blade or 
2 0 4 control port queues set by a 0 - bit programmable register — (called 
OAM cell destination register) . 



Resync cell — (OxFF) or any other special cells with fabric 

queue ID set to OxFx are routed to any one of 19G queues based on 
the OAM cell destination register too. — 
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Per d e stination minimum and maximum thresholds and counts 

can be set up to help memory management. — 200x2x14 bit thresholds 

(in unit of 200 - bit entry) and 200 x 13 bit running counters — (-irr 

unit — of — 200 - bit — entry) are — provided. ¥wo — additional — per- 

5 destination — transmit — and — loss — counts — (32 bit — e ach, — ±n — unit — of- 
packets) are also maintained. If the running count of a destination 
is above the relative threshold, new packets are rejected and loss 
count increments-. — Whenever dropping, — the whole packet is dropped. 

Otherwise, t+re transmit count increments . For multicast 

10 connections, — cells can also be rejected due to the multicast route 

word FIFO is full. — 4 additional FIFO full counts are needed. If a 

packet — is — dropped, — t+re — whole — packet — ±-s — cleaned — from — the — memory 

(including — t+re — segments — of — a — long — packet) . — Wre — thresholds — arrd- 
current counts are in unit of 200 - bit cache lines. 

15 c E L he — minimum — threshold — (13 bit — value — plus — 1 bit — enable 

bit ) — is used to prevent shared memory starvation, — i.e., — every queue 
reserves — at — least — t+te — number — of — cache — lines — indicated — by — the 
threshold. — The maximum threshold — (13 bit value plus — 1 bit enable 
bit ) — is used to prevent any single queue consuming the whole shared 

2 0 memory. Thes e two thresholds cannot be changed unless there are no 
packets in the queues. 

M-i — counters — are — 32 -bit — wide . — They — are — reset — to — zero 

automatically after reading. Their values stick to OxFFFFFFFF if 
overflowed. — It tak e s 2 A 32 x 0ns — - 32 seconds to overflow a counter 
2 5 in the worst cas e . 

The value of any threshold registers can be updated on - 

fly by a resyne cell or a shadow control cell. — The content of the 
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■32 — bit shadow data register is copied to the location pointed by 
the shadow address register. 

The memory controller can enqueue a single OC 192 — data 

stream from the aggregator AGIC and dequeue a single OC 192 — data 

5 stream to — the — separator AGIC instead of 4xOC 40 — streams . — At — the 

ingress — side, — the AGIC r e ceives — 4 — continuous cells /packets /cache 
lines — from — the — same — source — channel — instead — of — 4 — channels . — Vro 
special treatment is needed. — 

At the egress side, the Queue Drainer reads 4 cache lines 

10 from the shared memory for one destination after a token command is 
received for the OC 192 port. — The RCD can send up to 4 — 200 bit 
cache lines to the separator from the same destination queue. — Each 
OC - 192 port has 4 priorities for all switch configurations. 

The separator AGICs receive cell/packet streams from 12 

15 memory controllers, — separate, and send them up to 40 network blades 
through the backplanes . — The interfaces between the separator — emd- 
the backplane are 2D0MIIZ point to point HGTL signals. 

Figure 10 shows the Geparator AGIC architecture. 

Receive — 12 data streams from 12 memory controllers 

20 Fabric synchronization 

24 destination — (blades and channels) — addr e ssing 

Route word separation and aggregation 

0.2Gum 3V CMOG technology 

- 410 I/O pins 
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140 bit 250MHz input; 240 bit 250MHz output (at most 120 of 

them switch simultaneously) ; — 30 bit control signals 

c Phe — separator — ha-s — twice — number — of — data — output — pins — as* 

that of the aggregator ASIC to support 2X speedup. Oimilar to those 
5 of the striper AOIC, the AOIC supports 40G, 00G, 120G, 1G0G, 240G, 
and 400G switch configurations without backplane change. 

54re — s e parator — AOIC — performs — reverse — function — of — t-fre 

aggregator — AOIC. — t fhe — AOIC — receives — 120 bit — 250MIIz — cell/packet 
stream — from — one — erf — 8 — DOUT_ME_fb_se_bu — buses — of — every — memory 
10 controller (12 of them) . 10 bit blade and channel selection signals 
are used to select one of 24 destinations inside each separator for 
up to two cells. For example, the DINJ3F buses of separator AOIC #1 
is connected as follows : — 
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When a valid c e ll/packet — (channel ID is in the range of 

0 23) — irs — r e ceived, — the — pack e t — type — field — in — the — route — word — » 
checked — first . — ff — it — rs — an ATM cell, — no packet — length — field — t& 
followed. The length of c e ll payload is 3Gxl2/number of fabrics. If 
5 it is a packet, — th e packet length bit immediately followed is used 
to — indicate — how — long — a packet — l e ngth — arsi — 0-12 bit — packet — length 
(including — this — bit) — and — 1-24 - bit — packet — length — (including — this 
bit) . — The entire packet/cell is routed to the destination channel 
indicated by the — channel — f-Eh — "Phe — invalid channel — 3-B — (bigger than 
10 — is used to indicate that the cell/packet is invalid. — 

The AGIC — then separate the — route — word and the — payload 

onto the route word bus and the data bus of one of G blades and 4 
destination — channels /unst riper — AGICs — based — on — the — channel — H> 
signals. One 250MIIz 24 bit data bus yields GGbps data bandwidth for 
15 each channel. — Each route word is 2 bit wide running at 250MIIz. — 

Phe — connectivity — between — the — separator — AGICs — and — the 

Unstriper AGICs are symmetric to those between the aggregator AGICs 
and the — striper — AGICs . — Phe — only difference — irs — that — aii — data — and 
route word pins have double -width to achieve 2X speedup. 

2 0 Data — receiv e d — from — each — destination — of — each — memory 

controller — ha-s — a — 1 - bit — valid — feHrt — accompanied. — There — a-re — zM- 
destination — input — FIFOs — are — used — to — store — the — £-2 — pieces — of- 
cell/packets — from 12 memory controllers for 24 destination blade 
«nd — channels — in — e ach — separator, — respectively. — When — aii — i-2 — cell 

2 5 segments arrives, — the complete cell is sent to th e relative output 
FIFO indicat e d by the channel ID. 
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Like the striper ASIC, a 3 bit sequence number counter is 

maintained for the backplane synchronization. — It increments every 
30 250MHz cycles. — When a cell is sent to the unstriper AOICs via 
the backplane, — t+re — current — counter — is attached — into the — sequence 
5 number field in th e 30 bit route word. — 

54°re — sequence — number — counter — irs — reset — by — t+re — global 

re synchronization logic . 

• The unstriper AOIC takes — OGbps traffic from up to 12 1 1 

switch — fabrics . — ft: — then — unstripes — t-he — cell — arid — send — it — to — t-he 
10 egress netmod ASIC at 5Gbps or lower speed. 

■ Receive — OGbps route word and data from up to 12 I 1 — fabrics at 
250MHz — ftrr — OC4Q — or — combine — 4 — chips — to — support — 2-0 — Gbps 
routeword and data from up to 12 1 1 fabrics for OC192c 
Error — check — data — transport — throughout — the — switch, — detect 

15 corrupted data and perform data recovery 

Reconstructs cells/packets from the individual switch fabrics. 

Send 04 bit 100MHz data to the egress port ASIC for OC40, 250 

bit for OC192c 

■ Supports both UC and MC connection context — f-or — fabric data. 

2 0 Figure 19 shows the unstriper ASIC Architecture. 

¥he — unstriper — AOIC — receives — cells — from — ttp — to — 12 1 1 

fabrics, — each running at — 250MHz . — ft — uses — the — following — steps — ttr 
reconstruct good data. 
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■3r: — fti-t — incoming — routewords — a-re — compared. — ff — arry — one — routeword 
disagre e s, — that data lane is — flagged as being in error. — If more 
than one routeword disagrees, — the data is dropped. 

2. All valid input lanes are put through reconstruction logic which 
5 will attempt to build n i l — candidate output data — streams — for — an N 
fabric switch. Any data lane which is not valid will invalidate any 
data lane which uses that data. 

— ft-ti — valid — reconstruction — lanes — will — check — t+re — e&€ — of — the 
received data and one passing output is selected. 

10 The striper remaps the separate routeword and data buses 

to a combined outgoing routeword — i data bus. 

The following will detail the steps which happen at power 

trp — from — an — architectural — perspective . — Note — that — when — expanding 
switch — capacity, — the — additional — fabrics must — be brought — on line 
15 before any new port cards are brought on line. 

Fabric Initialization 

irz Port — cards — (unstripers ) — srre — initialized — t-e — only — look — at 

current fabric capacity and ignore other fabric inputs. 

9r~. Fabric is inserted, asserts its board present signal. Stripers 

2 0 start sending routewords to the new fabrics, — though they are 

ignored at this point. 
■3n Doard — ts — reset, M^f 2 — starts — to — boot — the — board. Before 

proceeding to the next step, — the MCP/3CP establish communica 

tion via the e net network. 
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4-. If the board is fabric 0 or the parity fabric, — the sync pulse 

transmitter is initialized. — (Actually sync pulse transmitter 
can be initialized on all fabrics, but it is only connected to 
DP signals if it is fabric 0 or the parity fabric.) 

5 5-: MP — initializes — sync — registers — in — the — aggregator, — memory 

controller, — and separator, — then initializes the registers in 
t-he — sync — pulse — receiver . — I Phe — sync — pulse — receiver — starts — tro- 
look for a valid sync pulse. — The last sync setup is the sync 
pulse receiver, — so that all receivers on the chips are ready 
10 -for — the — sync pulse — from the — sync pulse — receiver . — Fhe — fabric 
chips run chip-chip sync on the next backplane sync pulse. The 
MP should check to make sure the fabric has synchronized. — 5-f- 
sync — has — rrot — been — achi e ved, — res e t — the — fabric — chips — and — re- 
execute step 4 . 

15 GCF tells MP the current switch capacity window to use. — This 

is actually going to correspond to the current switch capacity 
(does — rro-b — count — the — capacity — of — the — new — fabric — i-f — switch 
capacity is being expanded) . 
MP — initializes — the — backplane — transceiver — networks — with — the 

2 0 current switch capacity — (both send and receive) — and initial 

izes all registers — except th e aggregator — input enables. — &rry 

values — used — for — configurable — options (which — ports — are 

OC40/OC1Q2 , — memory — thresholds, — etc) — need to be — communicated 
and — initialized — at — this — point . — Certain — registers — are — ini 

2 5 tializ e d bas e d on the switch board slot, — which needs — to be 

known at this point. — From a software perspective, the biggest 
regist e r — set — which must be done — rs — to updat e — the port mask 
table in the m e mory controllers to match the port mask table 
from another switch fabric. 
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fh Aggregator — input — enables — stre — set — for — the — current — switch 

capacity. — This will — start — enqueueing — traffic on this — switch 
board. The aggregators will need to see a bus idle followed by 
an increment in the transmit sequence number before — starting 
5 to actually receive data. 

■9-: OCT sends a queue resync cell. — On cell return, — fabric queues 

are now synchronized. However, no valid data is being enqueued 
in the new fabric (s) and the fabric outputs are being ignored. 

-Hh All unstripers must be configured to start utilizing the new 

10 fabric . — Since — queues — have — been — re synchronized, — the — fabric 

dequeuing should be synchronized and no errors should be seen. 
If errors are seen, — clear them, — return to step 0. 

■Hh After all unstripers have been updated, — SCP tells all port 

card MCPs to update stripe amount inside each of the striper 
15 AOICs . — She — change — irt — striper — configuration — will — start — the 

switch utilizing the additional capacity. 

■±-2-: After — a-H: — stripe — amounts — are — updated — emd — traffic — from the 

previous — stripe — amount — drained — from — the — switch, — then — the 
switch capacity needs to be updated. The only fixed time bound 
20 vrsy — of — ensure — traffic — from — the — previous — stripe — amount — irs- 
flushed is to execute a queue resync. — If not all traffic has 
been flushed from the system with the previous stripe amount, 
the — switch will — drop this — traffic — at — the — unstripers — (since 
there is no synchronization of the update at the separators, 
25 the drop cannot be performed there) . 



D e f ore — a — port — card — « — brought — on line, — any — necessary 

switch — fabrics must — be brought — on line — first . — &s — per — the — switch 
standard convention, — port card installation happens in order. 
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i-sr-. — The starting state has sufficient switch capacity to support 
the new port card. — Aggregators are currently configured to ignore 
the input from any new board. 

i-b-: — Port — card is — inserted and asserts — its board present — signal . 
5 Fort card sees sync pattern received from the fabrics. 

2i — 'Phe — sync — pulse — receiver — irs — initialized. — 54°re — port — card — starts 
looking for a valid sync pulse on the backplane. 

■4-: — Striper — transmitter — is — set — tap — for — the — appropriate — number — of- 
destination fabrics and the Gbit network control — rs — initialized. 
10 Defore the GDit networks are initialized, — the fabrics cannot count 

on seeing idle data from the new port card. At this point, — ttre 

port card can communicate its type — (OC40/QC192 ) — to the fabrics. 

■5a-: — Fabrics configure the port card type and enable the input from 
the port card. 

15 5fen — Gt riper /unst riper — aire — rro-w — initialized, — along — with — t-he — other 
chips on the board. — Gome enable in the inbound data path should be 
disabled . — The DID input enable in the striper can be used or some 
other board specific input enable. 

•6-. — After — both — &a — and — 5b — have been — complet e d, — the — port — card — ear* 
20 enable its input sid e and start sending data to the fabrics. — Note 
that — in — general - — furth e r — softwar e — configuration — will — need — to be 
done after this point — (such as s e tting up inbound lookup entries) . 
The completion of 5a is necessary to ensure the fabric queues do 
not go out of sync. 
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9-: — First data from the port card is striped to all fabrics. 

0. When a port card is removed from the system, not very much needs 

to happen from a hardware p e rspective. Defore the port card goes 

away, — it transmits a packet abort which will cause any incomplete 
5 packets in the egress side to the dropped. — Traffic will be drained 
from — the — memory — queues — which — correspond — to — the — affected — output 
ports . 

■9-: ¥o — remove — a — port — card — from — the — switch — logically, — software 

should disable the striper output bus. 

10 Fabric deactivation is similar to fabric activation in 

reverse . The steps include: 

irz — Switch capacity is being removed. — If port cards are present in 
the switch which are paired with the fabric capacity which is about 
to be — removed, — those must — first be deactivated. 

15 2i Program the remaining stripers in the system to stripe data to 

one less stripe amount than the current configuration. This will 

stop sending real data to the fabric about to be decommissioned. 

■3i Gend a queue resynch. This will flush out any traffic at the 

last stripe amount. 



2 0 Program the — unstripers — to — start — ignoring — the — data — from — the 

fabric which is about to be removed. 
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5i The fabric can now be physically removed from the system, — err 

logically — removed — from — the — system — by — disabling — itrs — inputs — and- 
utputs . 

The reason for the queue resynch step is not because the 

5 switch — i-s — ertrfc — ©■£ — sync . Wre — unstriper — will — treat — the — receipt — erf 

traffic which is striped to more fabrics than physically present in 

t-he — switch — as — an — error — and — increment — error — counts . Wre — queue 

resynch ensures that the error counts on the unstripers will not 
increment unnecessarily. 

10 irz — Flush out — traffic — from the port — to be — converted over — to APO. 
Initialize anything in the separator as required for the new output 
port — combination . 

9-. — Write to the APO enable bit using the shadow register in every 
memory controller for the output port being affected. The main port 
15 for APO is not affected. — Either a higher or lower number port can 
be the primary port and the backup port. APS is always enabled on 
the backup port. 

-3-: — Oend either a queue resync cell or a shadow control cell to all 
memory controllers . 

2 0 4. Memory controllers start to dequeue after the next left aligned 
cache boundary — (if the previous transfer — for — this port was — left 
aligned, — it will b e remembered) . 



Note that in all this process, — the queue number was never switched. 
I Phe — switch — will — not — support — a — seamless — port — swap — dtre — to — A-PS- 
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activate/deactivafe e . — (In other words, APO can be turned on port 0, 
which will cause port 0 to mirror port 10. — However, — APO cannot be 
turned off on port — 10 since — art — is not on. — Traffic is only being 
changed for the port where APS is added.) 

5 The following words have reasonably specific meanings in 

the vocabulary of the switch. Many are mentioned elsewhere, but 
this is an attempt to bring them together in one place with 
definitions . 



TABLE 22: 



1 0 Word Meaning 

APS Automatic Protection Switching. A sonet/sdh standard for implementing redundancy on physical links. 

For the switch, APS is used to also recover from any detected port card failures. 
Backplane A generic term referring either to the general process the the switch boards use to account for varying transport 
synch delays between boards and clock drift or to the logic which implements the TX/RX functionality required for 

the the switch ASICs to account for varying transport delays and clock drifts. 
BIB The switch input bus. The bus which is used to pass data to the striper(s). See also BOB 

1 5 Blade Another term used for a port card. References to blades should have been eliminated from this document, but 

some may persist. 

BOB The switch output bus. The output bus from the striper which connects to the egress memory controller. See 

also BIB. 

Egress This is the routeword which is supplied to the chip after the unstriper. From an internal chipset perspective, 

Routeword tne egress routeword is treated as data. See also fabric routeword. 

Fabric Routeword used by the fabric to determine the output queue. This routeword is not passed outside the 

2 0 Routeword unstriper. A significant portion of this routeword is blown away in the fabrics. 
Freeze Having logic maintain its values during lock-down cycles. 



Lock-down Period of time where the fabric effectively stops performing any work to compensate for clock drift. If the 
backplane synchronization logic determines that a fabric is 8 clock cycles fast, the fabric will lock down for 
8 clocks. 



Queue Resynch A queue resynch is a series of steps executed to ensure that the logical state of all fabric queues for all ports is 
identical at one logical point in time. Queue resynch is not tied to backplane resynch (including lock- down) 
in any fashion, except that a lock-down can occur during a queue resynch. 

SIB Striped input bus. A largely obsolete term used to describe the output bus from the striper and input bus to the 

aggregator. 

2 5 SOB One of two meanings. The first is striped output bus, which is the output bus of the fabric and the input bus 

of the agg. See also SIB. The second meaning is a generic term used to describe engineers who left Marconi 
to form/work for a start-up after starting the switch design. 
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Sync Depends heavily on context. Related terms are queue resynch, lock-down, freeze, and backplane sync. 

Wacking The implicit bit steering which occurs in the OC192 ingress stage since data is bit interleaved among stripers. 
This bit steering is reversed by the aggregators. 



The — Aggregator — R e ceiv e — Synchronizer 1 s — function — is — tro 

maintain logical cell/packet ordering across ai-t fabrics . 

5 Cells/packets arriving at more than one fabric from different port 
cards — need to be processed in the — same — logical — order — across — a-Hr 

fabrics . If cell/packet logical ordering is not maintained, — then 

cells/packets — coming — otrt — of — fabrics — will — have — stripes — of — a 
particular cell/packet not match up and will not be able to be re 
10 assembled by the Unstriper. 

Logical — cell/packet — ordering — needs — t-o — be — maintained 

across the following conditions : 

Transport — delay — variances — between — orre — source — and — multiple 
destinations 

15 Clock drift across transmitters and receivers 

- Insertion and removal of port cards and fabrics 

-* Port card errors such as no sync, — no lock downs, — too fast/too 

slow, — routeword parity errors 

Gigabit transceiver errors such as loss of lock, — data errors 

2 0 Hon synchronized updates to Gigabit network 

OC192c — dat a — streams (aggregating — A — channels — to — make — up — orre 
OC192e stream) 



Ffre — switch — uses — a — system — of — transmit — arrd — receive 

counters . L Hre — counters — allow — aii — components — in — the — system — to 
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logically — align — thems e lves . ¥fre — Master — Sequence — Generator 

implements th e se two counters that will count continuously from A 0 f 
t-o — — and will increment every x 125 MHz clock cycles where, — x is 
the counter tick length as programmed by software. — x is currently 
5 calculated to be 250 cycles. — This is based on analysis done in the 

Dackplane — Synchronization — ADS . I Phe — relationship — between — the 

transmit — smd — receive — counters — ean — be — seen — in — Figure — 2-9-: &ne 

counter will be used by the transmit synchronizers in the Otriper 
•arrd — Separator — ASICs — and — the — other — counter — will — be — used — xn — the 

10 receive synchronizers in the Aggregator and Unstriper ASICs. The 

receive counter will be a delayed version of the transmit counter. 
J Fhe — amount — of — delay — is — programmed by — software — in — the — Sync — Pulse 

Receive — Delay — register . This — register — determines — t-he — number — o-t 

clock cycles that the receive counter waits before incrementing its 

15 own counter relative to the transmit counter. — This register should 
always be noir zero since the transmitter will have no delay and the 

receiver needs to be delayed with respect to the transmitter. The 

Sync Fulse Receive Delay has been estimated to be 150 cycles. — E E L he 
delay — rs — approximated — equal — to — the — worst — case — transport — delay 

20 between transmitter — and receiver plus worst case transport delay 

variance — of — the — sync — pulse . 54°re — delay — also — takes — into — account 

worst case fast and slow transmitters and receivers. 



The Sync Pulse Period is defined as the number of cycles 

between sync pulses. — It is e xtended slightly by about 10 cycles in 
25 order — fer — it — to — appear — late — i-n — the — s -6- £ — window — erf — each — ASIC s 
sequence count. — This is done to ensure that every ASIC will appear 
tro — be — running — too — fast — even — if — they — are — actually — running — slow 

relative to the clock that generated the sync pulse. If this was 

not — done, — t-he — sync puls e could appear — in the — ^r 1 — window and the 
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AGIC would consider itself to be slow. There would be no way for 

it to catch up. Each transmitter and receiver will calculate the 

diff e rence between when — the — sync pulse arrives — and when — i±s — wn 
counter transitions from y 3 r to ^0' . — This difference is the number 
5 irf — cycles — that — i-t — irs — fast — and — i-s — referred — to — srs — the — lock down 
amount (2 in figure) . — Once a transmitter determines it should lock - 
down for z cycles, — it will finish sending valid data during its — HH- 

window and then lock down z cycles. During the lock-down period, 

no — valid — or — idl e — data — irs — sent . Instead, — a — special — lock down — K 

10 character is transmitted which will be recognized by the receiver. 
The receiver will not write the lock-down characters into its input 

FIFOs . This — will — ensure — that — the — input — FIFOs — can' t — overflow. 

Since the sequence counter does not advance for the amount of lock 
down, — it is effectively resetting itself to the sync pulse. It is 

15 equivalent of having the sync pulse appear at the start of the — MiH- 
count — window — since — the — transition — to — a — count — of — Hb-* — occurs 

precisely one tick length after the sync pulse arrives. When the 

next sync pulse arrives, — if clock frequencies are constant, — then 
the — sync — pulse — should — appear — ±rr — the — HF — count — window — and — the 

2 0 calculated — lock - down — amount — will — be — the — same — as — the — previous 

calculation . This — allows — the — system — to — always — expect — the — sync 

pulse arrival in the — HF — count window even if the clocks generating 
the — sequence counter are too fast or too slow. 



c Phe — Receive — Synchronizer — block — will — tree — the — sequence 

25 counter — to — determin e — when — to — accept — data — from — input — byte — sync 

FIFOs . Once a sync character — is read, — pops — from the — FIFOs will 

only occur once the sequence counter transitions — from "0" to "1" 

and immediately — following an arrival — of a — sync pulse. ¥he — read 

decision is only made once every sync pulse arrival and only at the 
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- 1X 6-" — tro — — transition — erf — the — receive — sequence — counter . The 

sequence — counter — a-s — also — used — during — fabric — resync — ±rt — order — to- 
communicate — a — fabric — resync — to — ai± — channels — am — att — aggregators 

during a — sequence — count — transition . Fabric resync cells — will be 

5 transmitted — at — the — beginning — of — a — sequence — tick — window — and — are 

prefixed by — a — special — character — indicating — a — resync — cell . L Phe 

receive — synchronizers — in — the — Aggregators — will — resynchronize — all 
data going to the memory controllers — on the next — sequence — count 
transition once the resync character has been received. 

10 A block diagram of the receive Synchronizer can be seen 

in Figure 21. The Receive Synchronizer consists of 24 — Dyte sync 

FIFOs , — a Crossbar and G Dus Synchronizers, — There is one byte sync 

FIFO per gigabit receiver. Each byte sync FIFO will accept data 

from each gigabit receiver independent of the mode of the switch. 

15 The byte sync FIFO depth is about 25G words deep. This depth is 

based on a derivation found in the Dackplane Synchronizer ADS. — c Phe 
Crossbar will handle the assignment of the appropriate input byte 
lanes to the correct channels. — Each Dus Synchronizer will consist 
of four Channel FIFOs and one Dus Controller. The Dus Controller 

2 0 -o&n — handle — A — separate — OC40 — channels — or — one — OC192c — stream. The 

channel — FIFO — i-s — about — 3r8 — words — deep. The — depth — irs — based on — the 

number of words to read a 3G bit routeword. — The whole routeword is 
read and then pr e s e nted to the r e st of the Aggregator in one cycle 
since it needs to be stored before the data of the packet as it is 

25 constructed and sent to the memory controller. 

Multiple gigabit receivers make up a 24 - bit data bus and 

2 bit routeword bus for one channel of an Aggregator. — Each gigabit 
receiver can handle up to 0 bits. Due to varying transport delays 
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that — c^m — exist between — receivers , — bytes — from different — receivers 

that belong to the — same word can be — skewed from each other. For- 

example, — the — 24 - bit — data — btts — and — 2 bit — routeword — btrs — fxrr — one 
channel — ©-£ — gm — aggregator — will — have — 4 — receivers — that make — trp — the 

5 bus . The synchronization logic will align all 4 bytes for the 2G 

bit bus — and will pass this byte aligned word to the rest — of — the- 

Aggregator . In order to align the bytes, — the Striper will need to 

send — a — special — alignment — byte — to — each — receiver . & — special — K 

character — can b e — utilized from the — gigabit — transceivers . Wre — K- 

10 character — will — be — encoded — rrt — the — data — bits — cm — the — Gigabit 
transmitter and will be detected on the Gigabit receiver. — 

The receive synchronizer in the Aggregator will consist 

of 24 — FIFOs where there is one FIFO per Gigabit Receiver. These 

FIFOs will handle both byte alignment arrd the backplane 

15 synchronization . It is assumed that the Gigabit Receivers will be 

able to distinguish between valid, — idle, — sync and lock down cycles 
and will indicate these various cycles to the Aggregator by using 
3 control signals. 

On startup, — the FIFOs will be empty and each Write State 

2 0 Machine (WDM) — will wait until a sync character is seen on its input. 
From this point on, — every cycle will be pushed e xcept for lock down 

cycles — from — the — fabric . Wh e n — the — fabric — rs — locking — down, — the 

Otripers will send special lock - down characters. This is done to 

avoid overflowing — the — sync FIFOs — in case — the write — side clock is 

25 fast e r than the read side clock. Whil e particular types of words 

are being pushed, — the word type will also be written to the FIFO so 
it can be distinguished on the read side. 
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The WOM is also looking for a special fabric resync cell 

K character that will indicate that a fabric queue resync cell will 

immediately follow. If a resync cell is detected, — a resync signal 

is passed along to Dus — Controller . The Dus — Controller — will — then 

5 tell other Aggregators on the fabric to resync their queues at the 

next transition of the — sequence counter . Fabric queue — resync — irs 

described in more detail later. 

Gigabit receivers are not dedicated to particular input 

channels, — but instead shared between various channels. Each byte 

10 sync — FIFO works — independently of the — switch mode — and each — input 
lane needs to be steered to the correct channel FIFO. — For instance 
in 4 0 mode, — 2C bits of data and routeword are required for Dus 1, 
channel A and therefore 4 byte lanes are required to be steered to 
each channel of Dus 1. — In 00/120 mode, — only 0 bits of data and 2 

15 bits — erf — routeword — are — required — and — therefore — two — bytes — will 

suffice . In 4 00 mod e , — only 4 bits are required per channel and one 

byte — lane will — suffice . As — switch capacity — increases, — less — and 

less byte lanes will be required for a particular channel. — For all 
switch — modes, — the — routeword — bits — for — a — particular — channel — will 

2 0 always come from the same byte lane. — As the byte lanes get reduced 
from 4 — to 1 byte lanes, — there will always be one common byte lane 
used to carry the routeword data lines. — The crossbar will take in 
5-4 — lanes consisting of 0 bits of data and 3 bits of control along 
with — other — control — signals — to — communicate — with — the — Brrs — Control 

25 logic . It will then forward all these signals to the appropriate 

channels . — The Crossbar will also accept control data from the Dus 
Controller and forward signals such as read requests and FIFO flush 

signals — to the appropriate — input byte sync — FIFOs . Each crossbar 

mapping betw e en input byt e lanes and channels is bi - directional. 
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The Dus Controller consists of three state machines. — £ Phe 

state machines control the read side of the byte sync FIF0s / — the 
write — side of the — channel — FIFOs and the — read side — of the Channel 
FIFOs . — On the read side of the Dyte FIFOs, — pops will not commence 
5 until a sync pulse has arriv e d and the receive sequence counter has 

transitioned from "0" to "1". A signal will be provided from the 

sequence generator block that indicates a "0" to "1" transition at 

precisely — this — moment (sync_even£) . At — this — time, the — Btw 

Controller — issues — a — read — to — the — Crossbar — for — the — particular 

10 channel . 'Phe — Crossbar — then — forwards — the — read — signal — to — the 

appropriate byte sync FIFOs based on the mode of the switch. £ 54 D r^- 

Crossbar then forwards all data and control — from these byte sync 

FIFOs — back — to — the — Btrs — Controller — for — this — channel . Wre — Btrs- 

Controller checks the data types to make sure that the first word 

15 in the appropriate byte sync FIFOs are a sync character. If the 

first word of any of the appropriate byte lanes for this channel is 

not — a — sync — character, then — a — sync — error — will — be — flagged, 

appropriate byte sync FIFOs will be flushed and the synchronization 
process — will — be — re - initiated. ff — the — first — word — rs — a — sync 

2 0 character, — then pops will — continue . In OC40 mode, — this process 

will be performed independently for each channel. OC192c support 

is discussed later on. 

Once data starts being read from byte sync FIFOs, the Dus 

Controller — will — ignore data — until — it — finds — the — first — idle — word. 
25 Once an idle word has be e n found, — it can now start looking for the 
S9-P — indication — in — the — routeword — wh e n — the — n e xt — non idle — word — » 
read. — The rest of the routeword is processed and made available to 

the — rest — of — the — Aggregator . ff — the — stop — birt — in — the — routeword 

indicates — that — the — packet — irs — continuing, — then — data — will — be 
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continuously — mad e — available — to — the — Aggregator — until — a — stop 

indication is read. Note that even though a GOT is seen, — it does 

not mean that this segment is the first segment of a packet. ft 

can be any segment of a packet. — Even though the segment may not be 
5 the first one of a packet, — it is allowed to go through the switch 
and will be dropped later on. 

When a sync character is read, — a counter is initialized. 

The — counter — counts — each — read — from the byte — sync — FIFOs . The — Bus 

Controller — will — expect — to — s^e — a — sync — character — every — sync — pulse 

10 period(about 22, 000 cycles). — If a sync character is read too early 
or too late, — then a sync error is flagged, — data is dropped at the 

precise — logical — cycle — of where — a — sync — character — i-s — expected . & 

packet that is b e ing process e d at the theoretical logical cycle for 
sync — will — be — terminated — and — inputs — will — be — disabled — until — re- 

15 enabled by Q/W. For example, — if after the first sync character, 

the next — sync character occurs at cycle — 19, 000, — and then a — sync 
error is flagged. — Data is not dropped until 22, 000 reads have been 
performed. — Also, — if after the first sync character, — the next sync 
character is not received at all after 22, 000 cycles, — then a sync 

20 error is flagged and data is dropp e d at this precise logical cycle. 
If a sync character is received precisely 22,000 cycles after the 
last one, — then reads from the byt e sync FIFOs are stopped until the 
receive sequence counter transitions from '0 f — to — HH-^ — Waiting for 
the — to — Hr* — transition — will — ensure — that — gri± — fabrics — aw 

2 5 receiving the same stripe of a packet on the same logical cycle. 

For OC192c, 4 input chann e ls need to be concatenated into 

one OC192c stream. In this mode, — the Dus Controller will control 

all 4 channel FIFOs and th e appropriate byte sync FIFOs. — Data type 



-105- 



checking will be performed across 4 times as many byte lanes as in 

the 0C4Q case. When it is time to read byte sync FIFOs, — the Dus 

Controller will control 4 read control lines to the Crossbar. — "Phe 
Crossbar will initiate reads across all appropriate byte sync FIFOs 
5 that are required for OC192c and will present data back to the Dus 
Controller . — The Dus Controller will check data types and will look 
for OOP indications. — The GOT indication and stop bits will only be 

found — in — t+re — Routeword — ftrr — channel A. Wte — Btrs — Controller — will 

write all 4 channel FIFOs at the same time when writing data and 
10 will present the complete OC192c Routeword in one cycle to the rest 

of the Aggregator. 54-re — functions of the Dus Controller — will be 

identical for OC40 and OC192c except that all 4 channel FIFOs will 
be controlled when in OC192c mode. 

Special — cases — can — be — broken — down — into — the — following 

15 categories : 

Port card insertion 

i-. Port card removal 

2i Port card errors including : 

An No sync character 

2 0 Eh Port card not locking down 

6-: Routeword parity errors 

Eh Garbage data 

Eh Port card sending data too fast or too slow 

3-: Fabric Queue resync 

25 Non - synchroniz e d updates to Gigabit network 

When — a — port — card — is — inserted, — the — port — card — present 

signal — will be asserted and sent — to each — fabric . Not until — 
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enables the particular inputs and the Aggregator sees the port card 
present signal, — will the Aggregator be ready to accept data from 

the new port card. Once enabled, — the Aggregator will go through 

the process of looking for sync characters on individual byte lanes 

5 associated with — t+re — new port — card. ft — is — assumed — that — t+re — port 

card will not send any data until it has been configured only after 

the — fabrics — have — be e n — initialized. Onc e — t+re — port — cards — srre 

enabled, — they will start — sending sync characters periodically at 

every global — sync — pulse — arrival . ft — rs — important — that — a-ti — the- 

10 appropriate fabrics see the sync character from the particular port 
card — since — some — fabrics — will — be — initialized — later — than — others . 
After sync characters have been received, — all data will be written 
on each cycle excluding lock-down characters. 

When — a — port — card — irs — about — "bo — be — removed, — the — enable 

15 switch on the port card will be turned off. This will signal the 

port card to finish sending valid packets and then send idles. — 54°re 
port card will send a packet abort k character to indicate that no 
more — valid packets — will — be — sent — immediately — following — the — last 

valid packet. ft — is assumed that when the port card is actually 

2 0 removed, — it will have already s e nt the packet abort — k character. 

This is critical for the fabrics to keep their queues in sync. ft- 

is important that each Aggregator on each fabric that handles the 

particular port card — stops forwarding — data to the — memory 

controllers at precisely the same logical cycle. — The WOM will stop 
2 5 writing — data — into — the — byte — sync — FIFOs — once — the — packet — abort 

character is seen. The Dus Controller will terminate the packet 

once the pack e t abort character is read out of the byte sync FIFOs. 

Case A : — No sync/early sync/late sync from port card. 
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Solution : — The Synchronizer will look for a sync at precisely the 

same — logical — cycle — each time. This will occur — every — sync pulse 

period that — is — approximately 22, 000 — 125MHz cycles. i-f — the — sync 

character — is not present at the head of the byte sync FIFOs when 
5 22, 000 cycles have been read since the last sync character, — a sync 
error will be flagged and data will be dropped the cycle where the 
sync character should have been. — All fabrics need to drop data at 
precisely the same logical cycle for this particular input lane. 
Inputs for this particular channel will be turned off and the byte 

10 sync FIFOs used for this channel will be flushed. S/W will turn 

of-£ — the — offending — Striper . Inputs — will — be — ignored — until — 

enables these inputs again. — If a sync character arrives too early, 
then data should be dropped at precisely the cycle where the early 
sync was read. — Other Aggregators will make the same drop decision 

15 -rf — this — error — k — common — to — srtri — fabrics . £f — t+ne — sync — character 

arrives too late or not at all, — then the drop decision will be made 

where — the — sync — character — wa-s — expected. She — sync — character — rs- 

expected to arrive every 22,000 cycles after the last sync. 

Case D : — Port card not locking down. 

2 0 Solution : If the port card does not lock down, — it will then send 

more — than the — ideal number of valid and idle cycles between — sync 
characters . — This will be caught by the same logic that checks for 

sync — charact e rs — i-n — the — corr e ct — logical — cycles . Data — will — be- 

dropped the — same — way as — in — the — case — where — rro — sync — came — from the 

2 5 port card. 



Case C: — Routeword parity errors. 

Solution : — If a parity error is detected for a particular routeword, 
the packet will be terminated at the bad segment and a parity error 
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will — be — flagged. Data — will — be — dropped — after — this — terminated 

segment is forwarded to the rest of the Aggregator and FIFOs — for 

this particular channel will be flushed. Inputs will be disabled 

until re - enabled by G/W. 

5 Case — B-: — Garbage data — from port card while all — fabrics — already in 
sync . 

Solution : — If the data is unrecognizable by the gigabit receivers, 
errors will be formed and provided to the Aggregator by the gigabit 
receivers . At — the point — of error, — data being written — into byte 

10 sync FIFOs will be flagged to be in error. If the Dus Controller 

sees — that — the particular byte — lane — in error — is — not — used — for — the- 
Routeword bits, then the error will be flagged but the data will be 

passed on — to — downstream logic. This — rs — considered — to be — a — soft 

failure since queues will — still be able to stay in — sync . If the 

15 Dus Controll e r sees that the particular byte lane in error is used 
for the Routeword bits, then the packet will be terminated and then 

dropped once the erred word is read from the byte sync FIFO. ¥-he 

input will be disabled, a gigabit receiver error will be flagged to 
0/W and byte sync and channel FIFOs associated with this channel 

2 0 will b e flush e d. This is considered to be a hard failure. If the 

failure occurs only for one fabric, — th e n other fabrics can still be 
used to re ass e mble the packets. — 0/W will have to queue resync the 
bad fabric. — If this error occurs across multiple fabrics, — not much 
can be done to avoid — fabric queues — from becoming corrupted. 9/W 

25 will then have to queue resync all fabrics. 



Cas e — Eh — Fort — card — sending — data — too — fast — or — too — slow. ft — i-s- 

possible that the port card is sending the correct number of valid 
cycles b e twe e n sync characters but is not locking down enough or 
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locking — down — few — much — during — each — lock down — period. Dyte — sync 

FIFOs can eventually ov e rflow or underflow respectively. If more 

than one fabric have FIFOs that overflow or underflow and data is 
dropped — crfc — diff e r e nt — logical — cycles — ftsr — t+re — same — source , — then 
5 fabric queues can become out of sync. 

Solution : — This — irs — considered a hard — failure — since — it — should not 

occur — irf — the — hardware — is — working — correctly. c E4°re — only — wsry — tro 

possibly prevent this — is to flag an error — if the — FIFOs — reach an 
almost full or almost empty threshold. — This is a warning sign that 

10 something — is — wrong. S/W — will — then — turn — crf-f — t-he — offending — port 

card. Data will continue to be written to and read from the byte 

sync FIFOs as if nothing is wrong. If the port card can be turned 

off and idles be s e nt before byte sync FIFOs overflow, — then there 
will be no dropped data and fabric queues will stay in sync. ff- 

15 FIFOs overflow or underflow for a particular channel, — then a FIFO 

overflow/underflow — error — will — be — flagged. Wte — packet — being 

processed — by — the — synchronizer — at — tire — time — of — error — will — be- 

terminated. All data will be dropped from this point on. Inputs 

for this channel will be' disabled until re enabled by 0/W. FIFOs 

20 for this channel will be flushed. 

Fabric queue resync is performed in order to- 

resynchronize memory controller queues. It is important that all 

fabrics — a-re — proc e ssing — fe-he — stripe — of — the — same — cell — or — packet — erb 
precisely the — sam e — logical — cycle and that all — fabrics — erre — acting 
25 together as one logical fabric. — Fabric queue resync starts at the 

Stripers . ¥he — Otrip e r — will — rec e ive a queue r e sync cell — from the 

control port. The striper will decode the queue resync cell and 

will — back — ttp — traffic — until — the — next — sequenc e — counter — tick — «■ 
reached. ft* — this — point, — it — will — send — a — fabric — queue — resync — K 
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character — immediately followed by the queue resync cell. At the 

fabric, — the WDM in the receiv e synchronizer will receive the queue 
resync — K — character — and — notify — the — Btrs — Controller — in — the — receive 
synchronizer that a queue resync cell is in the input FIFO and that 
5 the queue resync event should occur at the next transition of the 
receive sequence counter. — The Dus Controller will then indicate to 
other Aggregators on the fabric that a resync cell event will take 

place — at — the — next — transition — of — the — sequence — counter . 'Hre 

indication is asserted about 10 cycl e s b e fore the receive sequence 

10 counter transitions . This is done to allow enough time for other 

Aggregators to see this assertion before their respective receive 

sequence — counters — transition — also . Once — the — sequence — count 

transition — occurs, — the — Aggregators — will — signal — to — the — memory 
controllers that a queue resync event has occurred and that — this 

15 event delimits old and new data. All data — sent before the — sync 

event is considered old data and all data sent after the sync event 

is considered new data. The memory controllers synchronize their 

buffers — accordingly . ¥he — resync cell — is — eventually sent — through 

the switch as a regular cell and returned to the control port. 

2 0 There can be times when the gigabit network is changing 

its operating mode and the switch is changing from a 40/00 — to an 

00/120 — mode — for — example . There — irs — no — guarantee — that — Gigabit 

Receivers will be driven by Gigabit Transmitters during this time 
period . Aggregators — that — expect — good data — from certain — Gigabit 

25 Receivers may not g e t good data. If the switch is increasing its 

mode, then a previously unused FIFO will now be used. — If this FIFO 
has garbage data on its inputs, then syncs will not be received and 
this FIFO will not be synced until the gigabit network is stable. 
Once the Gigabit network is stable, — idles and sync characters will 
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be transmitted by the port cards and the — FIFOs will have enough 

time — tro — sync — tipr ff — ttre — switch — ±-s — decreasing — i-fe-s — mode, — then 

previously us e d FIFOs will now be unused. — The Aggregator will know 
the new — switch capacity and will — eventually — ignore — these — channel 
5 FIFOs. 

L H°re — Unstriper — needs — "bo — provide — back pressure — t-o — the 

Separators when internal FIFOs in the Unstriper become near full. 
Each Separator will expect 24 separate back - pressure signals coming 
from all — the port — card channels — rt — ts — connected to. Wte — back 

10 pressure signal is considered to be asynchronous to all ASICs. Pb 

is required that all relevant Separators receive back pressure from 
a particular channel in the Unstriper at precisely the same logical 

cycle . This — is — done — by — having — the — Unstripers — assert — H°re — back 

pressure — signal — when their — receive — sequence — counter — transitions . 

15 -ft — is assumed that — the Unstriper' s — receive — sequence — counter — i-s — a- 

delayed version of the Stripers transmit sequence counter. Since 

the tick length is 2D0 cycles and the receive counter is delayed by 
150 cycle relative to the transmit counter, there exists 100 cycles 
of margin to transport the back-pressure signal from the Unstriper 

2 0 to the Separator. The Separator needs about 10 cycles before the 

transition — erf — its — sequence — counter — t-o — sample — t+re — back pressure 
signal . — This will give the Separator enough time to provide back - 
pressure to the memory controller before the counter transitions. 
This places a maximum requirement on the propagation delay of the 

25 back pressur e signal. The following requirements hold true : 

Dack pressur e — propagation — delay — — counter — tick — length receive 

sync pulse delay setup time of Separator' — sample point 
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Dack pressure propagation delay < 250 i&S i& 

Dack pressure propagation delay < 90 cycles @ 125 MHz or 720 ns 

Assuming worst case conditions; — the expected worst case 

propagation delay would be: 

5 Dack - pressure propagation delay - — (Unstriper to Striper delay) — h 
(Otriper to Aggregator delay) — I — Aggregator to Separator Delay 
Dack pressure propagation d e lay - 5 cycles — (chip and board delay) 
h — (5 i G2 cycles) (chip and port card to fabric delay of 500 ns) — i — & 
cycles — (chip and board delay) 
10 Dack pressure propagation delay - 77 cycles < 90 cycles 

As c«m — be seen from — this estimate, the — maximum 

back- pressure propagation delay requirement is met. 

Assuming sriti the relevant Separators receive the 

back- pressure — signal — before — the — transition — to — t+re — next — sequence 

15 count, — then — it — can be — synchronized to the next — transition of — the 
transmit sequence counter. — This will allow all relevant Separators 
to stop sending valid data at precisely the same logical cycle for 

orre — complete — counter — tick — interval . This — is — true — since — it — is 

assumed that wh e n the transmit sequence count e r — transitions, — the 

20 data that the Separators are sending ar e companion fragments of the 

same packet. If back - pressure — is — sampled again before — the — next 

counter transition, — then data will be stopp e d for another counter 
tick interval. — This mechanism implies that back - pressure can only 
be generated on a counter tick length granularity. 
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Since — there — is — no — direct — path — from — Unstriper — to 

Separator, — the back - pressure signals need to be re - routed from the 
Unstriper, — to — the — Striper , — to the Aggregator — and — finally — to — t+re- 

Separator . In order to do this, — each Unstriper needs to send the 

5 back-pressure — signal — t-o — the — corresponding — Striper — on — that — port 

card . ¥he — Striper — will — then — forward — the — back pressure — signal 

through — the — backplane — gigabit — transceivers — onto — the — Aggregator . 
The Aggregator will forward up to 24 separate back pressure signals 
to one Separator corresponding to G buses with 4 channels per bus. 

10 i Phe — back pressure — signal — will — always — tree — bit — 9 — of — the — gigabit 

transceivers . The — receive — synchronizer — block — in — the — Aggregator 

will forward the correct back-pressure signal for the appropriate 

bus and channel to the Separator. Since the gigabit receivers are 

not dedicated to any particular bus and channel, — the synchronizer 

15 needs to select the — correct gigabit receiver based on the switch 

configuration — just — like — it does — for — regular — data . Once — this — is- 

done, — ferrt — 0 of the gigabit receiver is — forwarded on as the back 

pressure — signal . Note — that — brt — 6 — rs — also — used — for — receiving — k 

characters and can change when sending a k character. In order to 

20 avoid mistakenly — interpreting bit — 6 — of — a — k — character — srs — a — valid 
back pressure signal, — the synchronizer will only sample the back - 
pressure bit when valid data is received from the gigabit receiver. 
In the case where a k character is received, — the synchronizer will 
hold the back pressure signal at its current value. — There is still 

25 a — case — wh e re — the — Striper — erem — be — sending — back-to-back — idle 
characters since th e re is nothing to send. — If the Striper needs to 
change the value of th e back pressure signal in this case, — then it 
will — send one of two — k characters — th a t — change — the back pressure 
value . — The two k characters that will be used are a set and clear 

30 of — the — back -pressure — signal . ff — the — synchronizer — receives — a- 
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back ■■ pressur e — s-e-fc — err — clear — character , — rt — will — set — err — clear — the 

back - pressure — signal — respectively. If any other — k character — i-s- 

received, — the current back pressure signal is retained. If valid 

data — is — received, — foi-fe — 9 — erf — the — appropriate — gigabit — receiver — i-s- 
5 sampled as the back - pressure signal. 

Although the invention has been described in detail in 
the foregoing embodiments for the purpose of illustration, it is to 
be understood that such detail is solely for that purpose and that 
variations can be made therein by those skilled in the art without 
10 departing from the spirit and scope of the invention except as it 
may be described by the following claims. 



WHAT IS CLAIMED IS: 



1. A switch' of a network for switching packets 
comprising : 

a plurality of fabrics which switch portions of packets; 

and 

a port card connected to the fabrics and the network for 
receiving packets from and sending packets to the network, the port 
card having a mechanism for tolerating whether any one of the 
plurality of fabrics has a failure and still sending correct 
packets to the network. 

2. A switch as described in Claim 1 wherein the 
plurality of fabrics includes n fabrics which receive from and send 
to the port card portions of packets, where n is greater than or 
equal to 2 and is an integer, where one of the fabrics is a parity 
fabric which sends to and receives from the port card parity data 
regarding the packets. 

3. A switch as described in Claim 2 wherein the 
tolerating mechanism has a striper which sends portions of packets 
as stripes to the n fabrics to which they correspond, and which 
calculates a checksum of the packet and adds it to the packet 
before it is striped. 

4. A switch as described in Claim 3 wherein the 
tolerating mechanism has an unstriper which receives the stripes 
and parity data from the fabrics, calculates the parity data from 
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the stripes received, and compares the parity data received with 
the parity data calculated to determine if one of the fabrics has 
failed. 

5. A switch as described in Claim 4 wherein the 
unstriper calculates the checksum for each fabric, replaces the 
data from each fabric in turn, and compares the calculated checksum 
for each fabric to the checksum calculated for each fabric received 
with the packet calculated before the packet is striped, if the 
unstriper has determined one of the fabrics has failed, and 
recovers the stripe from the fabric that has failed from the other 
stripes . 

6. A switch as described in Claim 5 wherein the checksum 
is 16 bits. 

7. A switch as described in Claim 6 wherein each fabric 
has an aggregator which receives the stripes from the port card, a 
memory controller in which the stripes are stored and a separator 
which sends the stripes back to the port card. 

8. A method for switching packets comprising the steps 

of: 

receiving packets at a port card from a network of a 

switch; 

sending to fabrics of the switch portions of the packets 
as stripes from the port card; 
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switching the portions of the packets with the fabrics; 

sending back to the port card the portions of the packets 
as stripes from the fabrics; and 

sending correct packets with the port card to the network 
even though one of the fabrics has a failure. 

9. A method as described in Claim 8 wherein the sending 
to fabrics of the switch step includes the step of sending to n 
respective fabrics n stripes of portions of the packets, where n is 
greater than or equal to 2 and is an integer, and where one of the 
fabrics is a parity stripe having parity data concerning the packet 
to a parity fabric. 

10. A method as described in Claim 9 wherein before the 
sending the n stripes step, there is the step of calculating a 
check sum of the packet with a striper and adding it to the packet 
before it is striped. 

11. A method as described in Claim 10 wherein the 
sending back to the port card step includes the step of receiving 
at an unstriper of the port card the stripes and parity stripe from 
the fabrics, calculating with the unstriper the parity data from 
the stripes received, and comparing the parity data received from 
the parity stripe with the parity data calculated by the unstriper 
to determine if one of the fabrics has failed. 

12. A method as described in Claim 11 including after 
the comparing step, there is the step of calculating with the 
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unstriper the check sum, replacing the data from each fabric in 
turn, comparing the calculated check sum for each fabric to the 
check sum received with the packet calculated before the packet is 
striped, identifying which fabric has failed, and recovering the 
stripe from the fabric that has failed from the other stripes. 

13. A method as described in Claim 12 wherein the check 
sum is 16 bits. 

14. A method as described in Claim 13 wherein the 
switching step includes the step of receiving the portions of the 
packets as stripes at an aggregator of the fabric, and storing the 
portions of the packets in a memory controller of the fabric. 

15. A method as described in Claim 14 wherein the 
sending back to the port card step includes the step of sending 
with a separator of the fabric the portions of packets in the 
memory controller as stripes back to the unstriper of the port 
card. 

16. A method for switching packets comprising the steps 

of: 

receiving packets at a port card from a network of a 

switch; 

sending to fabrics of the switch portions of the packets 
as stripes from the port card; 

switching the portions of the packets with the fabrics; 
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sending back to the port card the portions of the packets 
as stripes from the fabrics; 

determining one of the fabrics has a failure; and 

determining which one of the fabrics has the failure. 
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ABSTRACT OF THE DISCLOSURE 

RECEIVER DECODING ALGORITHM TO ALLOW HITLESS 
N+l REDUNDANCY IN A SWITCH 

A switch for switching packets. The switch includes a 
plurality of fabrics which switch portions of packets. The switch 
includes a port card connected to the fabrics and the network for 
receiving packets from and sending packets to the network. The 
port card has a mechanism for tolerating whether any one of the 
plurality of fabrics has a failure and still sending correct 
packets to the network. A method for switching packets. The 
method includes the steps of receiving packets at a port card from 
a network of a switch. Then there is the step of sending to 
fabrics of the switch portions of the packets as stripes from the 
port card. Next there is the step of switching the portions of the 
packets with the fabrics. Then there is the step of sending back 
to the port card the portions of the packets as stripes from the 
fabrics. Next there is the step of sending correct packets with 
the port card to the network even though one of the fabrics has a 
failure . 



